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Introduction to “Origins of Human 
Language: Continuities and Discontinuities 
with Nonhuman Primates” 


There have been a number of contributions in the past years about language 
origins from various points of view. In this book, we intend to contribute 
to establish a state-of-the-art of the knowledge about the continuities and 
ruptures between communication in primates and language in humans. A 
major strength of the present book is to explore a diversity of perspectives 
on the origins of language, including the description of vocal communica- 
tion in animals, mainly in monkeys and apes, but also in birds, the study 
of vocal tract anatomy and cortical control of the vocal productions in 
monkeys and apes, the description of combinatory structures and their 
social and communicative value, and the exploration of the cognitive en- 
vironment in which language may have emerged from nonhuman primate 
vocal or gestural communication. Interestingly, this portrait emerges from 
a situation in which one long-standing hypothesis stating that a low larynx 
position was a prerequisite for the emergence of speech has been clearly 
discarded. Indeed, some contributors of this book have just participated 
to two papers showing that the monkey vocal tract was “speech ready” 
(Boé et al., 2017; Fitch et al, 2016). This renders the debates clearer, in 
that neurocognitive and social evolutions now unequivocally appear as the 
major potential sources of evolution towards language. The series of eleven 
chapters provides a rather complete portrait and elaboration on the facts, 
proposals, arguments and claims that pave the science way from animal 
communication to human language. 

The book begins by a descriptive analysis of baboon calls by Caralyn 
Kemp, Arnaud Rey, Thierry Legou, Louis-Jean Boé, Frédéric Berthommier, 
Yannick Becker and Joél Fagot. In their study of the “Vocal Repertoire of 
Captive Guinea Baboons (Papio papio)”, the authors provide ethograms 
and a prototypical description of twelve kinds of vocalizations emerging 
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from the analysis of individual calls and call sequences in the vocal rep- 
ertoire of a group of captive Guinea baboons. Typical sound examples of 
each type of vocalization are also provided in Supplementary Materials. 
This study will be of substantial value for students of primate vocaliza- 
tions. More importantly in the context of the present book, it provides a 
concrete and significant example of the “phonetic” description of the vocal 
communication system in nonhuman primates, which contributes to the 
documentation of the precursors of human speech possibly enlightening the 
conditions of its emergence. Of importance here is the fact that exploitation 
of variations in various dimensions of the vocalizations appears as a possi- 
ble way to increase the efficiency of communication without expanding the 
vocabulary of available units. Interestingly, the large co-variations between 
formants and fundamental frequency also suggest a non-independent mas- 
tery of vocal source and vocal tract configuration in baboons’ vocalizations. 

The next chapter is in continuity with the previous one, providing a 
zoom on one of the twelve baboon vocalizations. Louis-Jean Boé, Thomas 
R. Sawallis, Joel Fagot and Frédéric Berthommier question “What’s up with 
Wahoo? Exploring Baboon Vocalizations with Speech Science Techniques”. 
Focusing on the “wahoo” vocalization, they analyze a corpus of 69 ut- 
terances of wahoo calls coming from the corpus of the previous chapter. 
Careful spectral analysis of these utterances provides major spectral peaks 
separately for the three proto-components {w}, {a} and {hoo}. These peaks 
are compared with those of a [wa.u] phonetic sequence uttered by a human 
speaker in various phonatory modes. In parallel, the authors propose an 
articulatory analysis of a film presenting a baboon uttering a wahoo vocali- 
zation. Altogether, they claim that these combined acoustic and articulatory 
analyses converge towards the assumption that baboon “wahoo” is rather 
similar to a human phonetic sequence that can be transcribed as [wa]|.ut], 
with a first syllable chaining a back rounded semi-consonant /w/ and a front 
open /a/ produced in an ingressive way, and a back rounded /u/ produced 
in an egressive way. 

The exploration of vocalizations in nonhuman primates continues with 
Adriano R. Lameira proposing a view on “Origins of vowels and con- 
sonants: Articulatory continuities with nonhuman great apes”. From his 
study of the call repertoire of orangutans, the author introduces the idea 
that there could exist an articulatory homology between voiceless calls and 
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human consonants on the one hand, and between voiced calls and human 
vowels on the other hand. Among the set of voiceless calls, Lameira fo- 
cuses on whistles and shows clear learning abilities in captive orangutans, 
which relates to a number of reports of learning processes in other great 
apes. Concerning voiced calls, Lameira displays kinds of “babbling” vo- 
calizations with rhythmic jaw movements similar to the ones displayed by 
infants, together with imitation games in which a captive orangutan is able 
to modify fundamental frequency in response to modulations of a human 
tutor. These plastic voiceless and voiced vocalizations could provide in the 
author’s view “proto-consonants and proto-vowels” in a kind of language 
precursor in a human ancestor. 

Importantly, vocalizations in nonhuman primates are constrained by the 
anatomy of the orofacial system. This is at the core of the contribution by 
Frédéric Berthommier, Louis-Jean Boé, Adrien Meguerditchian, Thomas R. 
Sawallis and Guillaume Captier dealing with “Comparative Anatomy of 
the Baboon and Human Vocal Tracts: Renewal of Methods, Data, and Hy- 
potheses”. This comparative anatomy aggregates a series of invaluable data 
enabling to qualitatively and quantitatively compare vocal tracts in baboons 
and humans. These data include (1) a dissection of two adult Papio papio 
heads, enabling detailed description of the vocal tract, the larynx and the 
tongue musculature, (2) fifty-six 3D MRI scans of Papio anubis baboons 
from 2 years to adulthood enabling authors to elaborate precise vocal tract 
biometry, (3) radiographic data for 127 human children from 3 to 7.5 years 
providing reference human biometry for comparison with the preceding set 
of Baboon data. This enables authors to claim that the hyoid bone would 
be placed one vertebra lower in human infants than in adult baboons — and 
also one additional vertebra lower in male human adults. The increase in 
the pharyngeal part of the vocal tract in humans would be accompanied 
by compensatory facial shortening, thus maintaining the vocal tract length 
similar in both species. On the basis of these data authors address the issue 
of how exaptation of articulatory patterns in feeding could have contributed 
to structure the articulation of speech sounds. 

Vocalizations in nonhuman primates also depend of course a lot on the 
cortical and sub-cortical networks available for vocal and orofacial control. 
The question of cortical control is explored in the next two chapters. Firstly, 
Veena Kumar and Kristina Simonyan discuss in great detail the “Evolu- 


10 Louis-Jean Boé, Joél Fagot, Pascal Perrier, Jean-Luc Schwartz 


tion of the laryngeal motor cortex for speech production”. Their starting 
point is that, as already discussed in the first chapter by Kemp and coll., 
laryngeal control seems much more precise and stable in humans. Kumar 
and Simonyan analyze possible differences in laryngeal cortical control 
between humans and nonhuman primates. Firstly, they recapitulate several 
studies from their group leading to the conclusion that, while laryngeal 
motor control would be localized both in the primary motor cortex and in 
the premotor cortex in humans, localization would be reduced to the pre- 
motor cortex in apes and monkeys. Their hypothesis is that the premotor 
cortex would be sufficient for basic functions associated to e.g. breathing 
or physical effort, but the fine control in humans would require the ad- 
ditional involvement of the primary motor cortex. This evolution would 
be combined with the emergence in humans of direct cortical connections 
towards the brainstem, while they would be indirect in monkeys. Finally, 
the cortical network of connections between the laryngeal motor cortex and 
parietal and temporal regions necessary for learning and control would also 
be much more developed in humans. 

William D. Hopkins then addresses the question of a potentially crucial 
cortical area for language, often considered as a potentially major piece in 
the emergence of language: Broca’s area. His contribution, entitled “Mo- 
tor and Communicative Correlates of the Inferior Frontal Gyrus (Broca’s 
Area) in Chimpanzees”, provides a rich synthesis of various types of com- 
parative data about the Inferior Frontal Gyrus in monkeys, chimpanzees 
and humans. Firstly, he provides a detailed analysis of the literature on the 
morphology and cytoarchitectonics of Broca’s area in primates and par- 
ticularly in the species his group studied most, that is chimpanzees. While 
the Pars Opercularis (ParsO) and Pars Triangularis (ParsT) are difficult to 
define in the Inferior Frontal Lobe in monkeys, ParsO can be rather clearly 
defined in chimpanzees, but defining a ParsT homolog is less clear. Areas 44 
and 45 present large volumetric expansions and more consistent leftward 
asymmetries in humans compared to chimpanzees, together with a larger 
amount of synaptic connections. The author then presents a number of 
results from his group displaying consistent correlations in chimpanzees 
between morphological properties of individual Inferior Frontal Gyrus and 
behavioral abilities associated with communicative actions (e.g. pointing 
manual gestures and attention-getting vocalizations) and tool-use. These 
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correlations seem partly genetically heritable. He concludes by discussing 
the implications of these findings in theories of language emergence. 

The next two chapters explore the way vocalizations could indeed con- 
stitute a real communication system likely to open the road towards human 
oral language. Firstly, Camille Coye, Simon Townsend and Alban Lemasson 
discuss the question of combination and compositionality, in their chapter 
entitled: “From animal communication to linguistics and back: insight from 
combinatorial abilities in monkeys and birds”. From their analysis of the 
very wide literature on compositionality in bird songs and monkey calls, 
the authors attempt to carefully disentangle what could be a “phonological 
level” in which non-meaningful vocal units would be combined in various 
ways to provide meaningful sequences, and what could be a “morphosyn- 
tactic level”, in which meaningful units would be combined for producing 
larger meaningful structures. They argue that most reports in the literature 
do not provide convincing examples of nonhuman compositionality in one 
of these two strict senses. Then they present some “promising examples” 
of proto-phonology in the composition of flight calls in chestnut-crowned 
babblers (Australian birds), and protomorphosyntax in the composition of 
meaningful calls both in southern pied babblers (South-African birds) and 
in Diana monkeys from forests in West Africa. Finally, they suggest some 
possible social pressures driving the use of compositionality, in relation 
with the complexity of the social organization, and the habitat constraints 
on communication pushing for complex vocal communication with low 
ambiguity and long-range facilities of use. 

Klaus Ziiberbubler then engages in a global evaluation of the ability of 
primate vocalizations to constitute “The Primate Roots of Human Speech 
and Language”. For this aim, he reviews the continuities between non- 
human primate vocal communication and oral language, but also some 
major limitations that can be traced in these continuities. First, the vocal 
tract seems speech-ready but cortical control is not sufficient to master the 
vocal source and the vocal learning and combination processes required in 
human speech. Second, the communicative content of the calls seems likely 
to be interpreted and even modulated by monkeys and apes in relation to 
context and audience. However, vocal call exchanges appear to convey low 
levels of intentionality — in reference to Denett’s scale — and nonhuman pri- 
mates could lack the ability to share intentions and goals. The author also 
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addresses the question of referential communication, a crucial component 
of human language. Monkeys and apes do display elements of referential 
communication, but rather focused on themselves. The lack of clear view on 
the nature of their “mental concepts” sets severe limitations on establishing 
strong links with human language. 

At stage, where the focus all over this book has been mostly put on vo- 
cal communication, the contribution by Katja Liebal provides a timely and 
important comparative overview of “What gestures of nonhuman primates 
can (and cannot) tell us about language evolution” . She begins by a review 
of arguments pros and cons for either vocal, gestural or orofacial com- 
munication as the possible unique precursor of human language, and she 
nicely shows that arguments in favor of one or the other are often partly 
incomplete or in some sense partial, and hence that no “unique precursor” 
theory is wholly convincing at this stage. Then, she focuses on what could 
be gained for a theory of language evolution by looking at gestural com- 
munication in monkeys and apes. Interestingly, this provides a number of 
echoes to the previous chapter by Zuberbühler, by discussing what aspects 
of gestural communication could display some continuity with human lan- 
guage. Intentionality is a basic component of communicative gestures, with 
clear evidence that both monkeys and apes monitor the attention of their 
partner and modulate communication accordingly. Flexibility — the ability 
to vary the context of use of a given stimulus — seems rather larger for ges- 
tures than for calls or orofacial productions. Gesture compositionality ap- 
pears rather weak, with only one or two possible examples in the literature. 
Referentiality and iconicity are debated. Altogether, the author stays in a 
careful position in terms of the gestural vs vocal origin of human language. 

The last two chapters open the angle of view even more widely by ad- 
dressing the question of the cognitive environment required for the emer- 
gence of language. Tecumseh Fitch focuses on “Dendrophilia and the 
Evolution of Syntax”. Syntax is classically considered as a highly specific 
property of human language, and Fitch continues his exploration of the 
specific cognitive abilities that make humans special and could trace a 
major discontinuity in the emergence of language. He introduces the as- 
sumption that this ultimate human cognitive ability consists in the capacity 
to manipulate “supra-regular” grammars, thanks to a structural working 
memory providing generalization and elaborating hierarchies. This is what 
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he calls “dendrophilia” — a tendency to organize sensory flows into tree- 
like structures. The author reviews experimental data in which various 
animal species have been claimed to manipulate grammar-like structures. 
He raises objections to each of these studies, to argue that humans are the 
single species able to manipulate supra-regular grammars. He concludes on 
the likely implication of Broca’s area in the neural implementation of this 
uniquely human process. 

Finally, Joél Fagot, Raphaélle Malassis, Tiphaine Medam and Marie 
Montant adopt the inverse perspective by “Comparing human and nonbu- 
man animal performance on domain-general functions: towards a multiple 
bottlenecks scenario of language evolution”. They propose an alternative to 
the search of a uniquely human capacity, and rather explore possible con- 
tinuities and discontinuities in general cognitive abilities. They successively 
analyze integration in time and space, integration across sensory dimensions 
and sensory modalities, and various types of categorization processes. In 
each of these domains, they document resemblances between animals and 
humans, and aspects in which humans display a specific behavior. Humans 
appear better at processing and learning complex sequences, at extract- 
ing global aspects of visual scenes, at integrating sensory dimensions, at 
extrapolating perceptual properties in equivalence classes and elaborating 
qualitative rules and generalizing these rules across domains. This results in 
various types of “bottlenecks” that could have constrained the emergence 
of language. The authors conclude on the specific importance of attention 
and working memory in the bundle of factors that seem to have co-evolved 
in the route towards human language. 

Although not exhaustive, we hope that the tour offered in this book 
will convey a clear sense of the progress that have been made in the field 
of language evolution, and also hope that this book will serve as resource 
for students and researchers in the field. We would like to thank all the 
contributors for their contributions. 
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Vocal Repertoire of Captive Guinea Baboons 
(Papio papio) 


Abstract: In order to study the evolution of language, it is useful to understand the 
communicative systems of nonhuman animals. To this end, descriptive ethograms 
of primate vocal repertoires are the ideal starting point. We examined the vocal 
repertoire of a group of captive Guinea baboons (Papio papio). Twelve vocalizations 
were readily distinguishable using individual call components and call sequences. 
Some of these vocalizations were sex and/or age specific (e.g., copulation grunts 
in females, moans in infants). We compared these vocalizations to those reported 
in wild Guinea baboons as well as the other baboon taxa. The Guinea baboons 
share the basic call units with the other baboon species. However, a large degree 
of variability occurs within call sequences (e.g. number of grunts within a bout, FO 
and calling rate [number of grunts/second]). The baboons also showed vocal vari- 
ability through the combination of different vocalizations (e.g. moans, screams and 
yaks in varying order and number within a bout) and the use of one vocalization 
(barks) in a new captive-specific context. The present study complements recent 
studies on the vocal productions of baboons, and opens several new perspectives 
on the evolution of language. 


Keywords: vocal repertoire, baboons, primate vocalizations, language 


1. Introduction 


The evolution of speech from more simplistic primate communication may 
have been a pivotal transition for our species (Smith and Szathmary, 1995; 
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Snowdon, 2004). However, evidence for how this occurred is scarce, with 
only a handful of features which define language, such as rudimentary 
forms of syntax (Ouattara et al., 2009), found in the vocalizations of some 
primates. It is important to point out, though, that only a small proportion 
of primate species have had their vocal repertoire described and analyzed 
(Zuberbühler, 2012). Ethograms are the first step towards better under- 
standing these vocal systems. They can provide the basis for comparative 
studies, and are especially useful for newcomers to the species and those 
who work closely with the animals (Fischer and Hammerschmidt, 2002). 
Careful analysis of vocal repertoires in nonhuman primates also provides 
the groundwork for systematically tracking the development of more com- 
plex vocal systems. Here we present the findings of a study on the vocal 
repertoire of a group of captive Guinea baboons (Papio papio), from which 
it was possible to determine their ability to produce vowel-like sounds (Boé 
et al., 2017). 

The description of the vocal repertoire of baboons has a complex his- 
tory due to the wealth of terminology used between and within taxa and 
researchers, and to indecision regarding species or sub-species status of 
this primate group. Regarding the latter, the so-called savannah species 
(Guinea: P. papio, Olive: P. anubis, Yellow: P. cynocephalus, Chacma: P. 
ursinus, Kinda: P. kindae; Hayes et al., 1990) have generally been con- 
sidered to be relatively homologous subspecies with similar vocalizations 
while the hamadryas baboon (P. hamadryas) has been considered, and 
thus studied, separately as a full species with its own unique vocaliza- 
tions (e.g., Estes, 1992). Recent genetic evidence suggests that the taxa 
should be considered as phylogenetic species or biological subspecies. 
Furthermore, this research has shown that hamadryas baboons have not 
greatly diverged from the other taxa and share genetic and physical char- 
acteristics with Guinea baboons (Newman et al., 2004). Vocalizations are 
particularly sensitive to the process of speciation (Lanyon, 1969) and their 
study may serve to provide additional information for baboon systemat- 
ics. However, while there is a large degree of similarity in the vocalizations 
between baboon taxa (Maciej et al., 2013a), not all vocalizations seem to 
occur in all species (Estes, 1992). 

The wild studies by Byrne (1981) and Maciej et al. (2013a) comprise the 
only published reports on the vocal repertoire of Guinea baboons, although 
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notes and some analyzes on individual vocalizations have also been made 
by Anderson and McGrew (1984), Andrew (1962), Maciej et al. (2013b) 
and Maestripieri et al. (2005). From these studies, we can determine that 
vocalizations seem to be important for this species to maintain contact but 
also to warn off predators (Anderson and McGrew, 1984; Byrne, 1981). 
However, these studies have limitations. Byrne’s ethogram did not include 
spectrograms or fine-detailed descriptions of all the vocalizations, Maciej 
et al. (2013a) did not include the vocalizations of juveniles and infants, 
and a variety of terminology has been used throughout the literature; this 
can make it difficult to compare the vocalizations or even determine how 
a particular call sounds. 

Analyzes of the vocal repertoires of primates have been conducted by 
ear (e.g. Byrne, 1981), or using temporal and frequency measures of indi- 
vidual calls and bouts of calls from spectrograms (e.g. Bermejo & Omedes, 
1999; Fischer and Hammerschmidt, 2002), and, more recently, cluster and 
principal component analysis (e.g. Gros-Louis et al., 2008; Maciej et al., 
2013a). Using discriminate function analysis, Bezerra et al. (2010) showed 
that subjective differentiation of vocalizations — that is, by audible and 
visual inspection — is relatively reliable. Commonly considered structural 
parameters of vocalizations in these analyzes include duration, frequency 
range, modulation, harmonics, and noise. 

The aim of our study is to identify the full range of vocalizations pro- 
duced by captive Guinea baboons and provide descriptions of each. After a 
first presentation of the general principles of our methods, we report below 
an overview of the different vocalizations in three sections: ‘Acoustic de- 
scription’ details the basic features of the vocalization, including variability; 
‘Context & usage’ defines how and when the vocalization was used; the 
‘Terminology’ section lists synonymous vocalizations and their terminology 
throughout the literature. We then present the results of the formant analyses 
which were conducted on several categories of vocalizations, including the 
grunts, barks, wa- (of wahoo), -hoo (of wahoo), yaks, and copulation calls. 
In our discussion, we will show that such detailed descriptions of the vocal 
repertoire of a nonhuman primate species open interesting perspectives on 
the evolution of language. Appendix 1 provides a glossary of the key terms 
used in this chapter. 
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2. Methods 


This research adhered to the legal requirements of France and to the Ameri- 
can Society of Primatologists Principles for the Ethical Treatment of Non- 
Human Primates. 


2.1 Subjects and Housing 


We recorded the vocal behavior of 31 Guinea baboons (12 males, 19 
females, aged between 2 months and 27 years at the start of this study; 
Appendix 2) which are maintained within three groups at the Rousset- 
sur-Arc Centre National de la Recherche Scientifique (CNRS) primate 
center, France (see Fagot et al., 2014 for housing details). This center also 
houses olive baboons, which are within auditory but not visual range of 
our subjects. 


2.2 Recording of Vocalizations 


We recorded the vocalizations of the baboons from September 2012 to June 
2013, with the behavior, social interactions and context noted. Ad libitum 
opportunistic sampling techniques of spontaneous vocalizations, which 
included social events and responses to stimuli occurring naturally within 
their environment (e.g., sheep [Ovis aries] passing next to the center), were 
used to record over 1000 vocalizations. The baboons were accustomed 
to humans standing and walking by the fence of their enclosures and the 
presence of the recorders and their equipment did not disturb the baboons 
from their natural daily activities. 

Recording took place between 8:00 and 21:00. We particularly focused 
on the half hour prior to feeding (16:30-17:00) as the baboons were more 
vocal, and more consistently vocal, during this time. Recordings did not 
occur between 17:00 and 18:00 when the baboons were eating, so as to 
avoid potential distortion of the vocalizations due to chewing and full cheek 
pouches. Digital Zoom Handy Recorders (H4n) with a Me66 Sennheiser 
microphone was used to record the vocalizations. This is a super cardioid 
microphone with a high sensitivity (SOmV/Pa + 2.5dB) and a wide (40Hz — 
23kHz) and flat + 2.5dB frequency response. Recording was conducted at a 
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distance from the baboons from 1m to 20m, with longer distances suitable 
only for the long-distance vocalizations. 


2.3 Vocalization Analysis 


After disregarding recordings where the caller could not be identified, or 
had poor signal to noise ratio, because of disruptive background noise 
or vocalizations overlapping each other, we created a database of over 
1000 vocalizations. Male and female vocal productions were separated 
in the adult and sub-adult classifications, but were combined for juveniles 
and infants. This decision was based on the lack of body size differences 
between male and female juveniles and infants, and the similarity in FO. 
Vocalizations were then grouped using several methods: by ear, visual 
inspection using spectrograms, broad descriptive features, and detailed 
formant analysis. A minimum of 10 recordings per vocalization were ana- 
lyzed for descriptive features. Our analyses focused on the fundamental 
frequency (F0), the number of individual call units per vocalization series, 
the duration of each call or phase, the duration of the interval between 
two calls in the same bout, the total duration of a calling bout, and 
formants (F1 and F2). The acoustical analyses of the vocalizations were 
performed using Praat 6.0.13 for spectrograms and high FO vocalizations 
(bark, yak) and (wahoo). A problem using Praat for the measure of FO 
is that it relies on the relative periodicity of the speech signal as com- 
puted based on short-term autocorrelation. This program is not adapted 
for inferring FO for the grunts, barks and chattering, because these calls 
exhibit some irregularities, additive noise, or very low FO values (< 60 
Hz), i.e. long periods. In our study, FO was inferred for these latter vo- 
calizations with a home-made Matlab script, which computed FO from a 
hand tagging of the periodicity of the acoustic signal (see Figure 1 for an 
illustration of this procedure). 


20 C. Kemp, A. Rey, T. Legou, L. Boé, F. Berthommier, Y. Becker, J. Fagot 


Figure 1: Illustration of the method used to measure FO for the grunts, barks 
and chatterings. The top panel, which shows the auditory signal for a 
chattering (see the definition of a chattering below), illustrates our hand 
tagging of the periodicity of the signal. The bottom panel shows the 
corresponding FO, which was calculated with our Matlab script from the 
interval durations (bottom panel). 
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3. Results 
3.1 Overview 


Twelve distinct vocalizations were distinguishable in the captive Guinea 
baboons. Some vocalizations were age and sex specific. Table 1 provides 
the full list of vocalizations and their occurrence per age and sex. Illustra- 
tive audio files of each kind of vocalization can be found in the webpage 
(https://osf.io/nr2ye/) provided as supplementary material. While broad de- 
scriptive features of vocalizations were useful in grouping calls and creat- 
ing distinct categories, formant analysis was only possible in 5 of the 12 
vocalizations (grunts, wahoos, barks, yaks, and copulation calls, see below). 
We selected the clearest recordings for each vocalization per sex/age group 
for analysis of their broad descriptive features (a minimum of 10 separate 
recordings was possible). 


21 


Vocal Repertoire of Captive Guinea Baboons 


* * s| x * « | %S90 %SI'O | syunss1e0y 
* ale rie Nae a |e %6L'OT MLSE | WTO | %08'9 ooge 
=|] « * an xs * « |%SSET) MCLE | CVV | PVE | %CIT) WIST | %60'T weaso¢ 
x «|e x x « | WOVL | WIFE | KHID | ASOT | %600 | WEST | WETO SJEA 
af [ale = |= . s | ~9E7] %SE0 | MSOT | ~ [ozo] %970] ~ see 
x ale So Wee * « | %LEOT| %88 | ASSE | %60'C | WLGSE |] W0L'S | %8ET syreg 
* * * x * « | %0FST| %60°C | %OTT | HTEO | %6V'E | AECT | %L6't a 
= re] 
g 5 £ > 
= = Zim] na L = = T 7 a > 
zelal Ele] sleleisl =| ao a 2] 8] =) & 
SIZ/S/s|sis| l8] elele)e)] 8 = a So o H Po g 
z SIS 5 Q 2, Q? a 8 (Z 2 S 5 E = 
= a. S| S 2 2 
2 z) 2 
uonemp | UOBesHe904 
SJURULIO.J od mP) eu syed | saseyg AIOBIjVI 9BL-X9G 
[eAsoqU] : 
Jo 1quinyy 


"SIIJSILAJIVAVYI IMPS AY] IAVY SUO1VZYVIOA OM] OU IVY] AJON `11 p40234 JOU pip am jnq K408aqwI S144 Ut 
padsasqo SVM UONYZYYIOA 244 WY SaqVIIPU ~ “UO1IYz1JVIOA 4041 40f ajqvansvau 40 ajqvoddv 10u Som I14S814aJIVADYI 
SIG] 1044 SAIVIIPUl .-, Y “JUNOIIV OJUI UIYVI IZIS ajduws qun “papsoIad a8vIUaI4/ad 241 sv pajou atv K108aIVI IEVP 
-xəs yova mqm punoj syv ag], ‘suoOgng vauiny Jo aLlojsadas ayy uigum uorwzypoa qova Jo sadnqva Buutfaq :T 49V 


T. Legou, L. Boé, F. Berthommier, Y. Becker, J. Fagot 


22 C. Kemp, A. Rey, 


MEO VT | ASHTI | WEN TL | WEFT | MIT IL | WET 
* x| x ~ = x * HIE | WEY SUPO|T 
* x| k * x x ISL %L8'0 CSO SUNL 
ca) 
a Fala | a ; « | %6ET ALTO | WETS AIT eee 
oe bra syunis 
* * * * * * %6€'0 %6€0 asvyd-om, 
| felz] TE » | ETS %9T'0 %46 e 
J L a 
z fy T 5 a > 
Sellal] slel Elele si =) =| a g =| = 
SEs gle] SeSe S| #) a] =] El ge) & 
B/=/8/8|S/S| 3/8 7| ell? a 3 =: Soz £ 3 
2 O1e/15 5 Q Spo 8 (Zj 2 S 5 z D 
a = T $ Bl E F = 
a 2 = 
uonemp | UOBesHe904 
SJUBULIOJ od a P e m sjeo | sosed AIOB9IVI 33L-XƏŞÇ 
sien jo soquinyy 


Vocal Repertoire of Captive Guinea Baboons 23 


3.2 Vocalizations produced by all or most age and sex categories 
3.2.1 Rhythmic grunts 


Acoustic description (see Figure 2): This tonal vocalization is characterized 
by the presence of multiple, single-phase calls of even temporal spacing, 
with clear formants (although sometimes only one formant could be detect- 
ed, particularly when produced by adult males). FO is low and sometimes 
changed within a bout but otherwise grunts were acoustically stable in their 
physical structure within the same bout. Grunt bouts did not vary much 
between contexts, although faster calling rates were found in contexts 3 and 
4 (see below, ‘Context & usage’). Calling rates were around 2.2 grunts/s in 
adults and sub-adults, 1.8 grunts/s in juveniles, and 1.01 grunts/s in infants. 
Infant grunts showed physical differences from adult and even juvenile 
grunts, with the loudness and FO being much higher. 


Context & usage: Rhythmic grunts were the most common affiliative vo- 
calizations and were used by all age-sex groups in nine main contexts: 
1) towards infants to elicit interaction, 2) towards mothers with infants, 
3) after an infant scream, 4) by an individual, not the mother, usually an 
adult male, holding an infant to its chest, sometimes bouncing it, 5) between 
hugging adults, 6) by males eliciting a female to copulate, 7) by males after 
copulation, 8) from dominant animals (or males) to lower ranked females 
when approaching to groom or sit close by, and 9) by a non-moving group. 
Grunts were almost always produced as a series of calls (bout). Between 2 
and 18 calls per bout were recorded; grunts were considered to belong to 
the same bout when they were less than 1.5s apart. 

The grunts were soft and therefore used only as short distance contact 
call. The production of grunts by one individual typically did not elicit 
grunts from other individuals, although this did occur, specifically in con- 
texts 4 and 9. The function of this grunting could not be determined. In this 
situation, several individuals were sitting within a meter from each other, 
looking in different directions; they were typically not physically interact- 
ing. One individual would begin grunting and the others would then join 
in. Infants only grunted in response to adult grunts. Grunts were produced 
with the mouth almost closed, the baboon’s ears were twitched backwards 
and the eyebrows raised with each call. 
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Figure 2: Rhythmic grunt of an adult female. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants as well as the 
characteristic vertical lines due to low FO periodicity (middle panel), and 
FO computed with our Matlab scripts (bottom panel). 
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Terminology: This section lists the terminology used within the baboon 
taxa literature that, based on descriptions or spectrograms, appears to cor- 
respond to the vocalization described here. Grunts (P. papio: Byrne, 1981; 
P. ursinus: Cheney and Seyfarth, 2007; Rendall et al., 2005; P. cynocepha- 
lus: Hall and DeVore, 1965; P. anubis: Ey and Fischer, 2011), rapid grunt 
(P. papio: Byrne, 1981), rhythmic grunts (P. hamadryas: Ransom, 1981; 
Smuts, 1985), basic grunt (P. anubis: Ransom, 1981; Smuts, 1985), broken 
grunting (P. anubis: Ransom, 1981), low amplitude grunt (P. ursinus: Engh 
et al., 2006), soft grunts (P. papio: Anderson and McGrew, 1984). 


3.2.2 Barks 


Barks were recorded in two main contexts — prior to feeding in response to 
the presence of humans (contact barks) and in response to the presence of 
sheep (alarm barks). They were not distinguishable by ear, but the analysis 
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did reveal differences in their acoustic structure. The difference between 
contact and alarm barks is described below. We observed barks to also 
occur in two other, albeit more rare, instances. One infant (Grimm), pro- 
duced barks after his mother was removed from the troop due to illness. 
We deemed these barks also as a form of contact (see Cheney et al., 1996), 
but were not included in the analysis of the contact barks. The second in- 
stance was in response to the alarm wahoos of the nearby olive baboons. 
As it was not possible within the constraints of this study to determine if 
our subject group could distinguish between the contact and alarm wahoos 
of the olive baboons, we did not categorize their response barks as either 
contact or alarm. Only one adult male, Articho, produced barks (in both 
the contact and alarm contexts), but this was rare and this vocalization was 
more typically produced by females, juveniles and even one infant (Grimm). 
Barks were produced with a rounded ‘O’ shape mouth. 


© Contact barks 


Acoustic description (see Figure 3): This bark is sharp and clear, with a 
defined and modulated harmonic structure, and lower signal-to-noise ratio 
than observed in alarm barks. The FO of contact barks typically followed 
a curved temporal pattern, rising in frequency (Hz) from the start of the 
call before returning to the starting frequency; this curved feature was less 
pronounced in adult male barks. The barks produced by Grimm after the 
removal of his mother were shorter (0.12 + 0.01s) than those he produced 
prior to feeding (0.18 + 0.01s), and the FO was similar. 
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Figure 3: Contact barks. These five contact barks were produced by an adult male, 
sub-adult male, sub-adult female, juvenile and infant, respectively. Audio 
signal (top panel), wide-band spectrogram (Praat) showing the first two 
formants (middle panel), and FO calculated with Praat (bottom panel). 
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Context G usage: Contact barks were largely produced in the hour prior 
to feeding when the baboons observed humans leaving the office complex 
nearby and when the humans were preparing the food. The baboons visu- 
ally fixated on staff when producing these barks. Barks by one individual 
could elicit barks in others to create a chorus. 


Terminology: Clear bark (Papio ursinus: Ey and Fischer, 2011; Fischer et al., 
2001b), dog-like bark (all savannah baboon species: Estes, 1992; P. cynocepha- 
lus: Hall and DeVore, 1965), contact bark (Papio ursinus: Cheney et al., 1996; 
Ey and Fischer, 2011; Fischer et al., 2001a), sharp bark (P. papio: Byrne, 1981). 


e Alarm barks 


Acoustic description (see Figure 4): With higher formants, alarm barks have 
quite the same general acoustical structure as the contact barks describe 
above; however, this bark type is noisier than the contact barks and less tonal. 
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Figure 4: Alarm barks. Audio signal (top panel), wide-band spectrogram (Praat) 
showing the first two formants and characteristic vertical lines due to 
low FO periodicity (middle panel), and FO calculated with Praat (bottom 
panel). These three alarm barks were produced by an adult female, a 
sub-adult male and a juvenile, respectively. 
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Context ¢& usage: Alarm barks were produced when the sheep were heard 
approaching, grazing next to and passing by the primate center. Single barks 
were the norm, although barking bouts (up to 6, with less than 1.5 sec be- 
tween calls) were recorded. The baboons visually fixated on the sheep, or 
in the direction from which the sheep could be heard approaching, when 
barking. 


Terminology: Fear bark (P. ursinus: Cheney and Seyfarth, 2007), alarm bark 
(P. ursinus: Cheney and Seyfarth, 2007; P. papio: Byrne, 1981), cough-bark 
(P. anubis: Ransom, 1981), harsh bark (P. ursinus: Fischer et al., 2001a), 
shrill bark (all savannah baboon species: Estes, 1992; P. cynocephalus: Hall 
and DeVore, 1965; P. anubis: Ransom, 1981; Rowell, 1966; P. ursinus: 
Fischer et al., 2001a), sharp bark (P. papio: Byrne, 1981). 
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3.2.3 Threat Grunts 


Acoustic description (see Figure 5): Threat grunts are a highly noisy call, 
with harsh but soft rolling egressive cough-like sounds. There were enough 
recordings of single call productions to suggest that a call should be consid- 
ered as the vocalization; however, vocal bouts were still common, although 
the temporal connection between call units was quite variable. The FO of this 
vocalization is low and unstable within each call, but the formants are stable. 
Although individual threat grunts are produced by sub-adults and adults as a 
single phase (i.e., continuous production), juveniles typically gave a seemingly 
double phase grunt, as if the sound hitched during production. 


Figure 5: Threat grunts of an adult male. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants and characteristic 
vertical lines due to low FO periodicity (middle panel), and FO computed 
with our Matlab script (bottom panel). 
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Context & usage: This vocalization was observed in two contexts. The first 
of these was in antagonistic situations between adult females, in which the 
aggressor produced the vocalization. The second context was in response 
to the sheeps; all sex-age groups produced this vocalization in this context, 


Vocal Repertoire of Captive Guinea Baboons 29 


although it was rarer in adult males and infants. Threat grunts were often 
observed in conjunction with barks (juveniles, adult females and sub-adult 
males) and wahoos (adult males only). Two calls were often produced 
within 1.5 sec before a long pause until the next call. 


Terminology: Threat grunts (P. ursinus: Cheney and Seyfarth, 2007; Engh 
et al., 2006). 


3.2.4 Yaks 


Acoustic description (see Figure 6): Yaks have an irregular harmonic struc- 
ture. The FO is also modulated, being highly variable within a single call, 
with a lower frequency at the beginning of the yak than at the end. This 
vocalization is typically produced as a series of high FO, single phase calls 
with even temporal distribution, although calling rate can increase with 
context intensity. Up to 50 yaks in a series were recorded, with calls being 
considered as part of the same series when produced less than 1.5 sec apart. 


Context and usage: This vocalization was produced by individuals being 
threatened or in distress. The corresponding facial expression involved a 
strong retraction of the lips. It may be that the call is a form of appease- 
ment, as suggested by Estes (1992). It did not appear to act as a recruit- 
ment vocalization. Infants produced yaks when they were rebuffed by their 
mother and were looking for comfort, often in the form of nursing. Yaks 
were produced as a long series of calls, but were also given in conjunction 
with screams and/or moans (infants only) in varying orders and numbers. 
Context suggested that yak-only series were given in lower intensity situa- 
tions, especially in comparison to screams. Yaks were produced by adults 
with the teeth bared and the body often cowed and shoulders hunched, with 
the tail lowered and ears back. Yaks by infants were not given with the same 
body posture; instead, infants were usually running after their mothers. 
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Figure 6: Yaks of an adult female. Audio signal (top panel), wide-band spectrogram 
(Praat) visualizing the harmonics (middle panel), and FO calculated with 
Praat (bottom panel). 
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Terminology: Yak/yakking (P. cynocephalus: Hall and DeVore, 1965; all 
savannah baboon species: Estes, 1992), geck (infants only — P. anubis: Ran- 
som, 1981; P. papio: Anderson and McGrew, 1984; P. hamadryas: Ransom, 
1981; Smuts, 1985), chirplike clicking (infants only — P. cynocephalus: 
Hall and DeVore, 1965), ick (of the ick-ooer, infants only — all savannah 
baboon species: Estes, 1992), fear bark (P. ursinus: Cheney and Seyfarth, 
2007), staccato coughing (P. hamadryas: Kummer, 1968), disjointed cough- 
ing (P. hamadryas: Ransom, 1981; Smuts, 1985), contact call (Rendall et 
al., 2009). 


3.2.5 Screams 


Acoustic description (see Figure 7): Screams were highly variable, probably 
the most variable vocalizations produced by the baboons. Calls could have 
either harmonics or no clear harmonics, with some recorded instances of 
calls having alternations of both characteristics. Durations were also vari- 
able, ranging from less than a second (quick yelps) to extended calls of over 
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2s. They could be produced as a single call or as multiple calls within a 
bout. The high FO (~1kHz) did not allow for formants to be observed. The 
maximum frequency observed was very high (approaching 20kHz). Some 
screams (or scream sections) were noisy and harsh with no clear harmonic 
structure. Harmonic production could be either clear or mixed with some 
noise. Inspection of screams found that the baboons could change FO quite 
rapidly and dramatically within a call. Screams were considered singular 
vocalizations that could be produced in bouts. Each call was analyzed 
separately. 


Figure 7: Screams of a sub-adult male. Audio signal (top panel), wide-band 
spectrogram Praat) showing the harmonics (middle panel), and FO 
(bottom panel) calculated with Praat. Note that FO is too high in this 
example for visualizing the formants, 
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Context & usage: Screams were observed in three main contexts: surprise, 
fights and maternal rejections (i.e., produced by infants when their mother 
did not allow nursing or clinging). Screams produced when the baboon was 
surprised by an event, such as a sudden movement or shock, was more a 
‘yelp’-like sound. Regarding the second context, screams were occasionally 
produced by the aggressing individual, but it was far more typical for the 
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scream to be produced by the individual being aggressed. Screaming from 
infants could produce reactions from adults and older juveniles, including 
grunting and physically comforting; screams due to maternal rebuffs rarely 
elicited a response from other baboons. These screams were strongly har- 
monic. Screams were often coupled with yaks and/or moans (infants only) 
in various combinations (e.g. yak-scream-scream-yak-yak-yak-yak-yak- 
yak-yak-scream-yak-yak-moan). A single yak often preceded a screaming 
bout. One sub-adult male baboon (Cloclo) and one juvenile (Feya) would 
produce a short scream after a single bark at feeding. Screams were pro- 
duced with the teeth bared and the lips retracted. 


Terminology: Scream (P. anubis: Ransom, 1981; P. papio: Byrne, 1981), 
screaming (Hall and DeVore, 1965), screeching (Estes, 1992). 


3.3 Vocalizations produced by adults and sub-adults 
3.3.1 Wahoo 


Wahoos in our population were primarily produced in three contexts: in 
response to the wahoos from P. anubis (contest wahoo), prior to feeding 
in conjunction with barks (contact wahoo) and in response to the sheep 
(alarm wahoo). The contest wahoos were typically produced in low light, 
making identification of the vocalizing individual difficult. As we could not 
be sure in our recordings if any of these vocalizations came from our group 
of Guinea baboons or the nearby olive baboons, they are not included in 
our discussion here. 


e Contact wahoos 


Acoustic description (see Figure 8): Wahoos are a two-phase, single call 
vocalizations with high and low FO sequences. As with the contact barks, 
these wahoos had a lower signal-to-noise ratio than those produced in the 
alarm context. The FO varies from the ‘wa’ to the ‘hoo’, with the latter 
typically produced with a lower FO. 
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Figure 8: Contact wahoos of sub-adult males. Audio signal of a wahoo (top 
panel), wide-band spectrogram (Praat) showing the first two formants 
(middle panel), and FO (bottom panel) calculated with Praat for the wa-, 
and with our Matlab program for the -hoo. 
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Context & usage: Contact wahoos were typically made by sub-adult males, 
although occasionally adult females also seemed to give a wahoo instead 
of a bark. However, it is important to note that while wahoos from adult 
females were often identified by ear, spectrogram analysis showed that 
these were more likely to be barks, with the ‘hoo’ sound being a faint con- 
tinuation of the exhalation of breath. During production, the mouth was 
widely opened in an elongated vertical ‘O’ during the ‘wa’, before closing 
to a horizontal opening for the ‘hoo’. 


e Alarm wahoos 


Acoustic description (see Figure 9): The ‘wa’ of alarm wahoos showed some 
similarities with the alarm barks, in that they were tonal with a large degree 
of noise. The ‘hoo’ production was distinct and of longer duration in this 
context, in comparison to the wahoos produced prior to feeding (Table 2). 
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Figure 9: Alarm wahoo of an adult male. Audio signal of a wahoo (top panel), 
wide-band spectrogram (Praat) for visualizing the first two formants 
(middle panel), and FO (bottom panel) which was computed with our 
Matlab program. 
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Context G& usage: Like alarm barks, alarm wahoos were in response to 
the sound of and/or the presence of sheep. Although wahoos are typically 
considered as a single call vocalization, a series of three wahoos were 
observed on a few occasions and double wahoos — with the first wahoo 
shortened and immediately followed by the second wahoo - were also 
recorded. 


Terminology: This terminology corresponds to both contact and alarm 
wahoos, as little to no differentiation in names have been noted within 
the literature. Wahoo bark (P. papio: Byrne, 1981), contact bark (P. ursi- 
nus: Cheney and Seyfarth, 2007; Ey and Fischer, 2011), wa-hoo (P. anu- 
bis: Ransom, 1981), wahoo (P. ursinus: Fischer et al., 2002 [note that 
the authors differentiate between ‘contact’, ‘contest’ and ‘alarm’ wahoos 
in terminology]; Kitchen et al., 2003), two phase/d bark (all savannah 
baboon species: Estes, 1992; P. cynocephalus: Hall and DeVore, 1965; 
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P. hamadryas: Ransom, 1981; Smuts, 1985), type 2 loud call (P. papio: 
Byrne, 1981), loud call (P. ursinus: Fischer et al., 2002; Kitchen et al., 
2003) bahu bark (P. hamadryas: Kummer, 1968), oohu roar (P. hama- 
dryas: Kummer, 1968). 


3.3.2 Roargrunts 


Acoustic description (see Figure 10): This is a series of loud grunts with 
a hum-like grunt typically preceding the call, a pause of 5-6 sec and then 
4-6 grunts produced in close succession as a crescendo, with a final long 
grunt or roar, similar to a double wahoo. The call sequence is variable, with 
the hum difficult to discern or absent, and the concluding roar not always 
produced. We did not record enough of these vocalizations to determine 
why there was so much variability in the entire bout. However, the grunts 
that make up the majority of the vocalization were always present and 
produced consistently. Although we did not analyze this vocalization due 
to the small sample size, we did note that the grunts had a low FO (~30Hz), 
with F1 typically around 440Hz and F2 at 1.1kHz. Each call within the 
bout was typically longer (0.34s) than those produced by the adult males 
during rhythmic grunting (0.11s), although the interval durations were simi- 
lar (~0.45s). The notable features of this vocalization are the long grunts 
produced at a slow calling rate (~1.7grunt/s). 


Context & usage: Adult males produced this call either prior to feeding or 
when the sheep were present, suggesting it is in response to high arousal 
level due to tension. 


Terminology: Type 1 loud call (P. papio: Byrne, 1981), roargrunt (P. papio: 
Byrne, 1981; Maciej et al., 2013b; P. anubis: Ransom, 1981; P. hamadryas: 
Ransom, 1981; Smuts, 1985), hum-roargrunt (P. anubis; Ransom, 1971), 
roaring (all savannah baboon species; Estes, 1992), crescendo of two-phase 
grunts (all savannah baboon species; Estes, 1992), grating roar (P. anubis: 
Estes, 1992). 
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Figure 10: Roar grunt of an adult male. Audio signal (top panel), wide-band 
spectrogram (Praat) for visualizing the first two formants (middle 
panel), and FO (bottom panel) computed with our Matlab program. 
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3.3.3 Male Grunts 


Acoustic description (see Figure 11): This vocalization was particularly 
difficult to analyze as it was often produced with accompanying acoustic 
displays that interfered with the recorded signal (see below, ‘Context & 
usage’). Therefore, we only provide spectrograms and audio files (see sup- 
plementary material) of prototypical examples of these calls, but no analysis 
was performed. The vocalization consists of a series of rapidly-produced, 
short (~0.05sec), breathy, strongly egressed grunts, which form a crescendo 
and sometimes end in a roar or double wahoo, similar to that heard at the 
end of some roargrunt sequences. 
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Figure 11: Male grunt of an adult male. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants and the 
characteristic vertical lines due to low FO periodicity (middle panel), 
and FO computed with our Matlab program (bottom panel). 
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Context & usage: This is an adult and sub-adult male vocalization, albeit 
rarely in the latter ones, and typically performed together with power dis- 
plays. These displays include fence shaking, jumping and rock throwing, 
and throwing the head back. Observations were made of both adult and 
sub-adult males as well as one infant (Grimm), one juvenile/sub-adult (Dan) 
and some adult females performing these displays without the vocalization 
or with only a subset of the full vocalization, suggesting that the coordina- 
tion to do both required development and strength. This vocalization was 
produced after fights with other males, when the sheep were present for 
long periods of time, prior to feeding — especially if feeding was delayed — 
and when the computer systems (see Fagot and Palleressompoule, 2009), to 
which the baboons usually had access, were blocked. These contexts suggest 
that this vocalization is associated with high arousal levels and frustration, 
as well as indicators of male size and strength. 
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Terminology: Could not accurately determine corresponding vocalizations 
in other publications but may be the deep grunts described for P. papio by 
Byrne, (1981). 


3.3.4 Two-Phase Grunts 


Acoustic description (see Figure 12): As we only recorded a few instances 
of this vocalization, it was not analyzed in any great detail. These grunts 
were produced as a series, with two grunts paired, i.e., in close temporal 
proximity with short interval duration (~0.04s) then a longer interval du- 
ration (~0.15s) before the next pair. Duration of each pair was 0.4s, with 
the first grunt longer (~0.27s) than the second (~0.13s). It is therefore 
recommended that the grunt is analyzed like the wahoo, and considered, 
as the name suggests, as a two-part call. Due to the production of this 
vocalization (see below, ‘Context & usage’), it is likely to be dismissed as 
panting. However, the clear formant structure (F1 = ~350Hz, F2 = ~2kHz) 
and controlled production indicates that this is a vocalization and not a 
consequence of running. Bouts were long (between 11 and 18 grunt pairs), 
with the first grunt being typically of a higher FO (~60Hz) than the second 
grunt (~50Hz) within each pair. 


Context c usage: Two-phase grunts are ingressive-egressive vocalizations, 
similar to panting. It was only produced by adult males, in contrast to the 
study by Byrne (1981), who found that all age-sex classes except infants 
produced this vocalization. The males were observed making this call while 
being chased by other adult males during fights. 


Terminology: Two-phase grunts (P. papio: Byrne, 1981), pant-grunt 
(P. anubis: Ransom, 1971) uh-huh (all savannah baboon species: Estes, 
1992; P. cynocephalus: Hall and DeVore, 1965), grunting (all savannah 
baboon species: Estes, 1992). 
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Figure 12: Two phase grunts of an adult male. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants (middle panel), and 
FO (bottom panel) computed with our Matlab program. 
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3.3.5 Copulation Calls 


Acoustic description (see Figure 13): Copulation calls are defined by the 
production of a series of grunt calls (never singular), with fluctuating speed 
and FO. Egressed grunt-like breaths without formants were occasionally 
dispersed between the true grunts and/or at the end of the series. Copula- 
tion calls were typically tonal. 
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Figure 13: Copulation calls produced by adult females. Audio signal (top panel 
wide-band spectrogram (Praat) showing the first two formants (middle 
panel), and FO (bottom panel) calculated with Praat. 
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Context & usage: Adult, sub-adult and even juvenile (rare) females pro- 
duced this vocalization toward the end of copulation, completing the call 
while running away from their male partner. It was also observed in one 
adult female (Mona) while being mounted by other females. The vocaliza- 
tion was preceded by a distinctive facial expression, in which the mouth was 
rounded into a ‘©’ shape, with the lips slightly pursed. Not all copulations 
were followed by copulation calls; however, the vocalization was produced 
more often than it was not and the facial expression was always present 
regardless of whether or not the vocalization was uttered. 


Terminology: Muffled growl (all savannah baboon species: Estes, 1992; 
P. cynocephalus: Hall and DeVore, 1965), copulation grunts/call (P. papio: 
Byrne, 1981; P. cynocephalus: Hall and DeVore, 1965; Semple et al., 2002). 
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3.4 Vocalizations produced by infants and juveniles 
3.4.1 Chatterings 


Acoustic description (see Figure 14): Chattering is a series of unevenly spaced 
single phased calls, which have a chuffing-like sound, possibly ingressive- 
egressive due to the production method (see below, ‘Context & usage’). The 
vocalization is quite soft in amplitude and not strongly harmonically struc- 
tured (i.e., noisy). Formants and FO were often difficult to discern, particularly 
in infants who produced much noisier calls than older juveniles. 


Figure 14: Chatterings produced by a juvenile. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants (middle panel), and 
FO (bottom panel) computed with our Matlab program. 
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Context c usage: Chattering was used during play behavior, usually while 
running. 


Terminology: Chattering (P. ursinus: Estes, 1992), panting (P. anubis: Ran- 
som, 1981; P. hamadryas: Ransom, 1981; Smuts, 1985). 
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3.4.2 Moans 


Acoustic description (see Figure 15): Moans are a single phase, single call 
vocalization. The call has a strong tonal structure with even formants and 
a gently arching, high FO. It sounds similar to a sheep vocalization. 


Figure 15: Moans produced by an infant. Audio signal (top panel), wide-band 
spectrogram (Praat) showing the first two formants (bottom panel), 
and FO (bottom panel) calculated with Praat. 
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Context & usage: This vocalization was only observed to be produced by 
infants, usually in response to maternal rebuff or in distress situations. As 
with yaks, moans seemed to be produced when the mother was walking, 
not allowing her infant to be fed and hold on. Series of moans were ob- 
served; this seemed to be an extension of singular calls. Moans were often 
accompanied by yaks, to produce the ‘ick-ooer’ sound described by Hall 
and DeVore (1965). However, we noted that several yaks often preceded 
the moan. Occasionally, moans were accompanied by a scream after their 
production, but the two were not linked in the same way as the yak-moan 
sound. 


Vocal Repertoire of Captive Guinea Baboons 45 


Terminology: Ooer (of the ick-ooer — P. cynocephalus: Hall and DeVore, 
1965), moan (P. anubis: Ransom, 1981; P. hamadryas: Ransom, 1981; 
Smuts, 1985), infant moan (P. papio: Byrne, 1981). 


4. Formant analyses 
4.1 Methods for formant analyses conducted on vocal categories 


Formant analyses were performed for several classes of vocalization, in- 
cluding the grunts (two-phase grunts excluded), barks, wahoos, yaks and 
copulation calls. Vocalizations of infants and juveniles, and adult screams 
were not considered for these analyses because of their FO, sometimes ap- 
proaching 1 kHz. The method used for formant analysis is explained in de- 
tails in the supplementary material of Boé et al. (2017). For these analyses, 
the part of the vocalizations containing formants were grouped into one 
file per vocalization type (e.g., bark). The grunt file included the rhythmic, 
threat and roar grunts. These different types of grunts were grouped to- 
gether because they were highly similar in their formants. The two-phase 
grunts were not included in the grunt file, because they differed slightly 
from the other grunts regarding their formants (see Figure 12). The bark 
files grouped the alarm and contact barks, and the wa- and -hoo files also 
grouped the alarm and contact wahoos. To limit the perturbation due to 
noise and to maximize the reliability of the LPC results and achieve the 
clearest possible characterization of the vocalizations, formant analyses 
were performed using frames from 0.5 to 2 seconds long, so that each 
frame encompassed several utterances. It was done with successive frames 
operating as a sliding window overlapping by 50%, and the results and 
subsequent processings were based on the frame outputs from this LPC 
processing. The frame database was then filtered to further control for 
detection errors, and all the frames missing F1 or with F1 or F2 values 
greater than 3 standard deviations from the means of their categories were 
eliminated from the dataset (see below). Also, FO was measured in the 
same frames using autocorrelation and peak-picking. The detailed corpus 
characteristics and LCP settings are provided in Table 3. Interested readers 
are referred to Boé et al. (2017) for an in-depth discussion of our choice of 
variables, regarding for instance the number of poles. 
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Table 3: Corpus characteristics and LPC settings. 


Grunt | Wa- -hoo | Cop | Bark | Yak 
39 3 3 F F = 


CORPUS 

N baboons 13 3 3 8 11 10 
N vocalizations 522 69 69 124 116 504 
Total duration (s) 65 11 15 10 29 69 
Mean duration (ms) 125 159 219 81 250 137 
LPC SETTINGS 

N poles 60 30 60 60 30 60 
Frame duration (s) 1 1 1 0.5 1 2, 


4.2 Results of formant analyses conducted on vocal categories. 


The acoustic results regarding FO and the first two formants are reported 
in Table 4 for each class of vocalizations. Table 4 reveals that the baboons 
produced high- (i.e., 3 wa-, @ bark, and ? yaks), and low vowel-like sounds 
(i.e., ¢ and 9 grunts, 3 -hoo, 9 copulation calls), which are characterized 
by F1 formants in the high and low ranges, respectively. Table 4 also dem- 
onstrates the production of front and back vowel-like sounds, characterized 
by F2 formants in the high (¢ -wa, ° bark) and low ranges (4 and ° grunts, 
3 -hoo, 2 copulation call, and 2 yak). We have no space here to present 
our analyses on formants in more details. However, note that this data set 
was analyzed in depth by Boe et al. (2017). Examining these vocalizations 
through modeling of their maximal acoustic space based on anatomical 
measures of the baboon’s vocal tract, Boe et al. (2017) demonstrated that 
these vocalizations share the F1/F2 formant structure of the human [i 2 a 
o u] vowels. From these results, we can conclude that the baboons can pro- 
duce several vocalic qualities differentiated by their formant structures, and 
that these structures are characteristic properties of vocalizations produced 
in distinct social contexts, or for different functions. 
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5. Discussion 
5.1 On the Guinea baboon’s vocal repertoire 


We observed and recorded twelve vocalizations, easily distinguishable by 
both ear and production/acoustic characteristics, produced by our group of 
baboons. Two calls, the bark and the wahoo, showed slight differences in 
acoustic features when produced in different contexts (contact and alarm). 
Interestingly, more types of vocalizations were given by adult males in our 
study than any other sex/age category (Table 1) and constituted the second 
largest proportion of recorded vocalizations despite the small sample size. 
Eleven of the vocalizations in our repertoire had certainly been reported 
previously, either for Guinea baboons or other baboon taxa, but we could 
not find a clear analogy to the male grunt vocalization in any of the ba- 
boon vocal literature. It is a possibility that Byrne (1981) had referred to 
this vocalization as the ‘deep grunt’ but with only a description of “long, 
low pitched grunt, fluctuating in pitch and volume. Adult males only (?)” 
(p. 287) it is difficult to be sure. 

Seven of the vocalizations we describe are short distance communica- 
tions; that is, their production did not allow for long-distance detection. 
The baboons showed a large range of FO production, from around 40 Hz 
for grunts to up to 1 kHz for screams. Feeding time and the occasional 
presence of sheep elicited the greatest variety of calls (barks, wahoos, 
threat grunts [in response to the sheep, only], roargrunts and male grunts) 
of any major contexts recorded. In regard to feeding, due to the cap- 
tive environment, we are able to report the first known transfer of two 
vocalizations (barks and wahoos) to a new context in this species. It is 
known that baboons will use barks and wahoos to contact conspecifics 
when moving through dense vegetation (Cheney et al., 1996; Rendall et 
al., 2000) but this is the first time these calls have been reported to be 
used as a contact with caretakers. 

We observed that some vocalizations could elicit vocal responses 
from conspecifics but found little evidence of communicative volleys 
between individuals. Some vocalizations (rhythmic grunts, screams, 
yaks, threat grunts, chattering and moans) could be directed towards 
specific baboons but they rarely elicited a vocal response. The bark or 
wahoo of one individual when observing (either visually or through 
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auditory means) the approach of a human at feeding time or the sheep 
would often result in the production of these calls (usually barks) from 
other baboons. However, these calls were directed at external stimuli. 
Wahoos produced by adult males at night are known to create volleys 
whereby males from different groups produced wahoos back and forth 
(Anderson and McGrew, 1984; Byrne, 1981). We observed this occur- 
ring between our Guinea baboons and the olive baboons at night, but 
never just between the males within our group, and certainly the alarm 
wahoos in response to sheep never elicited a vocal response from the 
olive baboons. Although screams have been considered a recruitment 
call (e.g., in infants, Rendall et al., 2009), we found no particularly 
strong evidence to support this hypothesis; only some of the screams 
from infants and juveniles resulted in a vocal response (rhythmic grunts) 
or physical approach from adults (most screams were produced during 
conflicts and may better act as appeasement). However, it is important 
to note that rhythmic grunts directed towards individuals could elicit 
rhythmic grunts. For example, adult males grunting towards infants or 
juveniles would sometimes get grunts in return as the two animals ap- 
proached each other. Hugging baboons would also often grunt. More 
research is required to determine the specific cues in the initial vocali- 
zation of one baboon that elicits the same vocal response, particularly 
when it is directed specifically to a conspecific rather than an external 
stimulus, in another baboon. 

One vocalization that is produced by all age- and sex-groups is the yak. 
The term ‘yak’ has been typically used for the adult production of the 
infant/juvenile ‘geck’ vocalization. ‘Geck’ or ‘gecker’ is a common infant 
primate vocalization (see Jacobus and Loy, 1981; Patel and Owren, 2007) 
and is usually not produced by adults within these other species. Despite 
the alternative naming, it has long been suspected that the ‘geck’ and the 
‘yak’ in baboons are equivalent. Certainly, we noted them in similar con- 
texts, although infants have additional contexts (e.g., maternal rebuff). 
Our analysis suggests that the calls are the same, with acoustic structure 
differences due to the caller (i.e., age, size, development etc.). Meanwhile, 
after infancy the moan vocalization is no longer produced and chattering 
disappears at some point during the sub-adult stage. 
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In the literature, wahoos are typically differentiated between those made 
by adult females, juveniles and even sub-adult males from those produced 
by adult males, which are considered more stereotyped (e.g., Byrne, 1981; 
Fischer et al., 2002). These studies suggest that in adult female wahoos 
the ‘hoo’ is often missing or inaudible. We propose that these calls are 
more likely to be barks. Also, as the ‘wa’ of the wahoo is suspected to be 
ingressive (Gustison et al., 2012; also, personal observation — authors CK, 
TL and YB) but a bark is egressive, it is unlikely that these are the same 
vocalization and we therefore suggest that they should be more clearly dif- 
ferentiated in repertoires. 

In a more general perspective on baboon’s repertoire, detailing the vo- 
cal ethogram of Guinea baboons is a first step in better understanding the 
differences between the baboon taxa. It is important that full ethograms, 
including those from infants and juveniles, are reported for the other spe- 
cies so that we can better understand how the socio-ecological conditions, 
morpho-physiological and behavioral differences, as well as geographical 
variations, have affected vocal use for these closely related taxa. 


5.2 On language evolution 


The main strength of our study is the description of the acoustic parameters 
of the baboon’s vocal productions, and the description of the ethological 
context in which these vocalizations were produced. In doing so, we fol- 
lowed a strategy which is not so different from language studies that try 
to map the acoustic features of the vocal production to meanings, as for 
example when phonology distinguishes the American words boat (/bot/) 
and bat (/bæt/) exclusively through the distinction between the /o/ and /z/ 
vowel phonemes they contain. This approach suggests at least three lines 
of discussion regarding the evolution of language. 

First, we note that the vocal repertoire of Guinea baboons is of a 
limited size (see McComb and Semple, 2005) for a species with a large 
social group size (Patzelt et al., 2011). The small repertoire of twelve 
vocalizations we report here is further constrained by the individual call 
types. That is, grunt-based vocalizations account for over half of the 
Guinea baboons’ vocal repertoire. However, it appears that the baboons 
can increase their repertoire through the use of variability. Variability in 
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vocal production occurred through changes of FO, tempo (calling rate), 
call duration, number of calls within a bout, and the combination of 
different vocalizations (e.g., the scream-yak-moan sequences of infants, 
bark-screams, double wahoos). More work is needed to identify whether 
the variability we observed in vocal production convey specific meanings. 
Addressing this question would require, for instance, comparing behavio- 
ral responses to long yak- or grunt series, in comparison to short series. 
The variations we observed in the baboon’s vocalizations suggest that a 
first stage in language evolution might have been to introduce variations 
in the production and use of a limited set of vocal units, rather than 
expanding the number of different vocalizations. Considering context 
and variation within vocalizations may be essential to determine how 
nonhuman primates expand their limited repertoire to communicate with 
conspecifics, and to document the emergence of language. 

Second, the analysis of the baboons’ vocalizations has shown that sev- 
eral of them contain formants, and that these formants differ from one 
class of vocalization to the other (see Table 4). It has long been thought 
that nonhuman primates are incapable of producing sets of vowels-like 
sounds due to anatomical limitations (in particular, a too high larynx, 
Lieberman et al., 1969). The observation that baboons produce different 
vocal qualities, in different ethological contexts, shows that nonhuman 
primates can produce contrasting vowel qualities despite a high larynx 
(see Fitch et al., 2016 for converging results). This finding suggests ho- 
mologies between baboons’ vocalizations and human vowel systems, and 
more generally, that spoken languages could have evolved from an ancient 
vocal proto-system already present in our last common ancestor with 
baboons (Boé et al. 2017). 

Third, Table 4 also reveals an interesting finding on language evolution. 
This table shows that FO varied greatly both across (e.g., 64Hz for grunt 
1 and 417 Hz for the wa (of wahoos)), and even within the vocalizations 
(417Hz for the wa- and 121Hz for the -hoo of the wahoos). In human 
languages, formants vary independently from laryngeal frequency, and the 
fundamental frequency of the baboons’ vocal production was not as stable 
as found in speech. This finding suggests that the production of FO and of 
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formants could have been entangled during the early stages of language 
evolution. Clearer dissociations would have emerged later in the hominid 
lineage. 

In summary, the data presented in this paper have two main functions. 
Firstly, this work aimed at serving as a reference guide for students of ba- 
boons’ vocalizations and those interested in the communication systems of 
nonhuman animals. Furthermore, in documenting these aspects of baboons’ 
vocal communication, this study also provides hypotheses on the emergence 
of speech. We believe that there is much to learn on these two aspects if this 
approach is replicated in other nonhuman primate species. 


Supplementary material 


Illustrative examples of the different vocalizations can be found at https:// 
osf.io/nr2ye/ 
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Appendix 1. Glossary of terms used 


Term Meaning 


Vocalization Type of vocal production, with series of calls taken into 
account (i.e., copulation call). That is, a vocalization 
can either be comprised of a single call (e.g. wahoo) 

or a bout of calls which can be temporally connected 
(less than 1.5s apart) to be within the same vocalization 
(e.g. rhythmic grunts which are never produced as a 
single call) 


Call Individual unit within a vocalization (i.e., single grunt 
within a series). 

Calling rate Speed of call production within the bout (number of 
calls/s). 

FO Fundamental frequency. Measured in Hz. 

Formant F1 F2 Acoustic resonances (first and second) of the vocal 


tract, affected by the position of the tongue, mouth 
cavity and lips. Measured in Hz. 


Maximum frequency The highest frequency (Hz) observable in our 
spectrograms. 

Noise Lacking harmonic structure 

Harmonics The simple periodic waves which make up the vocal 


signal, in which the FO is the first harmonic and each 
subsequent harmonic repeats at the interval of the FO. 


Appendix 2. List of the subjects involved in this study, their 
housing group, sex and age in months at the 
start of the study, as well as classification 


The broad age classifications used (adult: 7+ years; sub-adult: 5-7 years; 
juvenile: 2-5 years; infant: < 2 years) were based on studies conducted on 
P. cynocephalus (Altmann et al., 1981) and P. hamadryas (Sigg et al., 1982). 
* indicates that these individuals moved up an age category during this 
study (age category given is that at the start of the study). ^ indicates that 
most of the vocalizations recorded for these individuals were after the move 
to the next age category. ° indicates that these individuals were selected for 
formant analysis. Any vocalizations of these individuals recorded around 
the time frame of their transition to the next category were carefully con- 
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sidered before analysis classification but we largely kept to the definition 
of category class. 


Name Group Sex Age (months) Category 
Pipo ° 1 M 156 Adult 
Vivien ° 1 M 94 Adult 
Bobo 1 M 73 Sub-adult 
Dan 1 M 53 Juvenile* 
Felipe 1 M 27 Juvenile 
Filo 1 M 22 Infant*^ 
Grimm 1 M 12 Infant 
Harlem 1 M 2 Infant 
Kali ° 1 F 204 Adult 
Brigitte ° 1 F 199 Adult 
Michelle ° 1 F 199 Adult 
Mona ° 1 F 186 Adult 
Atmosphere 1 F 174 Adult 
Petoulette ° 1 F 162 Adult 
Romy 1 F 149 Adult 
Uranie ° 1 F 104 Adult 
Violette ° 1 F 92 Adult 
Angele 1 F 88 Adult 
Arielle ° 1 F 82 Sub-adult* 
Dream 1 F 51 Juvenile* 
Dora 1 F 49 Juvenile 
Ewine 1 F 37 Juvenile 
Fana 1 F 30 Juvenile 
Feya 1 F 25 Juvenile 
Flute 1 F 24 Juvenile 
Hermine 1 F 6 Infant 
Articho 2 M 82 Sub-adult* ^ 
Barnabe 2 M 74 Sub-adult 
Cloclo 2 M 66 Sub-adult 
Cauet 2 M 65 Sub-adult 
B06 ° 3 F 332 Adult 
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What’s up with Wahoo? 
Exploring Baboon Vocalizations with 
Speech Science Techniques 


Abstract: The baboon repertoire includes around fourteen less vocalizations. One 
of them, the “wahoo”, is referred to onomatopoetically, and is composed of three 
sounds in two syllables, which makes it interesting because it is the most complex 
baboon vocalization. We consider two hypotheses linking the baboon call and its 
human name: first, is there a demonstrable acoustical similarity with the human 
wahoo, and second, is there also further similarity related to articulation? We ana- 
lyze both acoustic and articulatory information regarding why this vocalization is 
perceived phonetically as two separated syllables [wa.u]. This study corroborates the 
hypothesis of two equivalence levels: one acoustic-perceptual and the other related 
to the production mechanism in baboons and humans. This reveals an apparent 
similarity between a typical baboon vocalization and an utterance entirely typical 
of human languages, and thereby adds to the links between non-human primates 
vocalizations and human speech. 


1. Introduction 


Baboons are African and Arabian Old World monkeys which constitute 
the genus Papio, part of the subfamily Cercopithecinae. They can be clas- 
sified into five species: hamadryas, papio, anubis, cynocephalus, and ursi- 
nus. They live in groups of 5 to 200 individuals and communicate with 
fourteen vocalizations (Hall and DeVore, 1965). It seems that except for 
Papio hamadryas, all baboons use comparable vocalizations. For half a 
century the vocalizations of baboons have been identified and associated 
with situations described ethologically (behavior and communication) (Hall 
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and DeVore, 1965; Kemp et al., 2017; Zuberbühler, 2012). As with many 
animal sounds, baboon vocalizations are dependent on the sex, the status, 
and the age of the individual (e.g. copulation call for females, moans for 
infants). Vocalizations are named with reference to their production mode: 
barking, grunting, roaring, screeching, moaning, and yakking. One of them, 
wahoo, corresponds to a sequence of three sounds, perceived as [w], [a], 
and [u], while other vocalizations are simpler. This can be perceived by 
humans as the bisyllabic onomatopoeia “wahoo” (The same onomatopoeia 
occurs with different spellings in other languages.) Among the vocalizations 
of baboons, the grunt has been the most studied acoustically. It is a voiced 
call, in which the fundamental frequency (FO) can be accurately measured. 
As its spectrum displays the formant structure characteristic of vowels, it 
can be called a vowel-like vocalization (Owren et al., 1997). Wahoo has 
been less studied (Cheney et al., 1995; Cheney et al., 1996; Fischer et al., 
2001; Fischer et al., 2002; Fischer et al., 2004; Kitchen et al., 2013; Maciej 
et al., 2013; Price et al., 2014). 

We aim (i) to test the accuracy of the onomatopoeia of the call’s name 
by acoustic and articulatory analysis (since the mimicry of other animal 
calls can differ greatly across languages), and (ii) to check the similarity of 
the production of the wahoo sequence as uttered by baboons and humans. 
More generally, this kind of study will help us better understand the specif- 
ics of communication by nonhuman primates. Observing such similarities 
and differences can also help in the search for any precursor elements of 
speech present in the vocalizations, and serve to infer the features of our 
common ancestral communication system (Fedurek and Slocombe, 2011; 
Fitch, 2002; Ghazanfar and Rendall, 2008; Zuberbühler, 2012). The com- 
parison is especially interesting because of the phylogenic distance between 
baboons and humans, with a last common ancestor estimated at about 25 to 
30 millions years ago. Better understanding of the analogy between human 
speech and a typical sequence of vocalizations in this distant phylogenetic 
cousin would add a valuable piece to the long history of studies compar- 
ing primate vocalizations to help illuminate the evolution of human oral 
language. 

We conducted an analysis that led to formulation of two hypotheses: 
the first, based on acoustic and perceptual analysis, posits equivalence of 
acoustic and phonetic features for baboons and humans, and explains why 
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we hear two separated syllables [wa.u]; the second posits, in addition, the 
equivalence of production processes between baboons and humans for this 
type of vocalization. 


2. Data 


Wahoos (also termed wahoo calls, double barks, loud two-syllable barks) 
are mainly produced by adult male baboons in various circumstances: 
danger situations (e.g. human or predator presence), intragroup aggres- 
sion between adult males, attacks on females, or when splitting up a group 
into sub-groups out of visual contact (Byrne, 1981; Cheney et al., 1995; 
Rendall et al., 2000). The first part, {wa}, is similar to a bark, a typical 
alarm call produced by females (Fischer et al, 2001). The production of 
the first syllable {wa} has been described as pulmonic ingressive (Gustison 
et al., 2012), while the egressive nature of the bark, classically considered 
as similar, has been assumed but not investigated. The pulmonic ingressive 
mode of phonation is documented in many human languages (Eklund, 
2004, 2008). However, it only occurs in paralinguistic contexts, and not 
phonemically. Furthermore, it has been little studied acoustically (Grau 
et al., 1995; Orlikoff et al., 1997). This ingressive production has also 
been noted in vocalizations of dogs, cows, horses, asses, and foxes, and 
also in purring in the domestic cat (Felis catus) and in the cheetah (Aci- 
nonyx jubatus) (Eklund et al., 2010; Peters, 2002). The second element, 
{hoo}, is more open than a grunt (i.e., a higher F1) and with a higher FO 
as well (Boé et al, 2017). 

We combined two types of data. The first set is a video, filmed in the 
wild and provided through YouTube (Larimer, 2012), of a chacma baboon 
P. ursinus, in which the wahoo is an indicator of male dominance (Fischer et 
al., 2002). The second set corresponds to 69 wahoos selected in the acoustic 
database of a previous study (Boé et al., 2017). Subjects were 3 adult males 
(6, 7, and 13 years) from a group of 24 guinea baboons (P. papio) living in 
semi-freedom, housed in a 750 m? outdoor enclosure at the CNRS primate 
center (Rousset-sur-Arc, France) (Fagot et al., 2014). 

The acoustic database was crucial for providing FO and spectral data in 
wahoo production by baboons compared with human speech (Section 3). 
The sample extracted from the video enabled articulatory analysis of great 
use in our articulatory analysis in Section 4. 
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3. Acoustical analysis 


A short-window acoustic analysis was conducted to highlight the spectro- 
temporal structure of the representative example of wahoo extracted from 
the video. On the other hand a medium-window spectral analysis was 
conducted for the three segments {w}, {a} and {hoo} of the 69 wahoos in 
the database. 


Figure 1: Acoustic analyses for wahoos from the video example (above, a — c) 
and from the database (below, d — f): (a) signal and FO ({w} is initially 
unvoiced, then FO for {wa} from Praat’s autocorrelation routine and for 
{hoo}, from manually tagged periods); (b) spectrogram (5-ms window); 
(c) Welch spectrogram; (d — f) Welch spectrograms in grey (500-ms 
window) and their averages in bold for {w}, {a}, and {hoo}. P1, P2, & P3 
correspond to spectral peaks. These are manually placed in (b — c) by 
analogy with (d— f). 
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3.1 The video example 


The analysis of the FO was done by autocorrelation using the Praat soft- 
ware application (Boersma and Weenink, 2014) in the {wa} segment and by 
manually picking the fundamental periods for the {hoo} segment. The FO 
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plateau that makes up the majority of the {a} segment is centered around 
170 Hz over a period of about 300 ms. The {hoo} part drops to around 
60 Hz. The wahoo vocalization thus covers a range of about 1.5 octaves, 
a larger difference than typical wahoo produced by a human. A classic 
spectrogram obtained by the Praat software with a wideband 5-ms analy- 
sis window (Figure 1b) and a second Welch 3D spectrogram established 
for comparison between the example and the database analyses (Matlab®, 
Figure 1c), with a sliding window of 72 ms, allows identification of spectral 
peaks that may be either harmonics or formants. The spectral peaks shown 
in the 3D spectrogram have a similar complex spectro-temporal structure. 
They will be labeled by relying on the average spectra calculated from the 
database (Figure 1d-f). 


3.2 Database 


Each of the 69 wahoos of the database was divided into {wa} and {hoo} sec- 
tions. Then the first 50 milliseconds of {wa} (corresponding to the initial FO 
variation) were labeled {w} and the remaining was associated to the segment 
{a} (see an illustration in the sound from the video example in Figure 1a). 
These three segments {w}, {a}, and {hoo} were concatenated into separate 
files. For each of the three files, the Welch spectrogram (signal processing 
toolbox MatLab®) was generated after pre-emphasis (by differentiation) 
in long windows (500 ms) representing approximately two tokens of each 
segment. Finally overall average amplitude was determined in order to 
show the mean spectral peaks. The Welch spectrogram enables a better 
display of the global spectral structure. The spectra of {w}, {a}, {hoo} for 
each 500-ms segment vary widely (thin grey lines in Figure 1d-f), but the 
average of all segments (bold line) shows clear peaks, labeled P1, P2, & 
P3. These peaks were identified in the spectrograms (Figure 1b-c). Values 
of P1, P2 and P3 are given in Table 1, below, together with FO values for 
each segment, measured with the method described above for the {w}, {a}, 
and {hoo} concatenated files. 
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Table 1: Data for wahoos from the database of 69 utterances by 3 male baboons. 
The duration of {w} is set arbitrarily to the first 50 ms of wahoo (thus, 
no variance in duration). For mean duration and mean FO, standard 
deviations are provided in parentheses. 


label {w} {a} {hoo} 
total duration (s) 3.45 8.00 14.80 
mean duration (ms) 50 (-) 116 (36) 214 (89) 
P1 (kHz) 0.791 0.981 0.484 
P2 (kHz) 1.054 2.275 0.972 
P3 (kHz) 2.239 - 2.175 
mean FO (Hz) 333 (106) 308 (102) 109 (35) 


4. Acoustico-phonetic question: Why do we hear [wa.u]? 


c 


We first wanted to understand why we perceive “wahoos” produced by 
baboons as the phonetic sequence [wa.u] (where the period represents a syl- 
lable break). Therefore, we compared the acoustic-phonetic characteristics 
of {wahoo} produced by baboons with a similar sound sequence produced 
by humans under several source conditions: (a) normal voice, (b) whispered 
voice, and (c) pulmonic ingressive voice on the first syllable, [wa] (and normal 
egressive voice on the [u]). This enables the comparison of FO and spectral 
properties. As background, recall that (human) [w] is present in 74% of the 
representative sample of 451 human languages provided in the UPSID data- 
base (Maddieson, 1986). Phonetically, [w] is a glide, a voiced labio-velar ap- 
proximant, which means that it is articulated with the back part of the tongue 
raised toward the soft palate, while rounding the lips. Acoustically (Calliope, 
1989, pp. 118-119; Ladefoged, 2006; Potter et al., 1947, pp. 202-206), 
the first formant (F1) of [w] (around 0.3 kHz for a male speaker) is always 
more intense than both F2 (around 0.65 kHz) and F3 (around 2.5 kHz). F1 
and F2 of [w] have a characteristic transition as an opening movement of 
the vocal tract (Figure 2). However, one can sometimes also perceive [wa] 
from a sequence [ua] in hiatus, i.e. when [u] is very short but without the 
rapid formant transition responsible for perception of [w] (Delattre, 1968). 
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Figure 2: Spectrogram of {wahoo} (Praat, 50 ms analysis window) (a) normal 
egressive speech, human male 1, (b) ingressive speech, human male 2, 
(c) whispered speech (i.e. with no laryngeal source, human male 2, (d) 
baboon, from the YouTube video. The four sets of peaks are manually 
placed in order to show their similarity below 1.5kHz. 


Hence, we analyzed [wa.u] sequences produced by a male speaker in normal 
modal voice or with ingressive [wa] (as in Figure 2 panels a & b). In our 
recorded human examples, the wahoo FO contour covers a range of 0.75 
octaves, much less than the 1.5 octaves in baboons. To simplify the spec- 
tral comparison between humans and baboons, we suppressed the signal 
above 1.5 kHz by a low pass filter, knowing that it would not significantly 
change the perception of the [a] (Delattre et al., 1952). We observe (with 
short-term windowing of 5 ms) that in the lower part of the spectrum the 
two spectro-temporal structures are quite similar. We hand-marked the 
spectral peaks that are common to the four utterances. The first peak of 
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the speaker’s [w] corresponds to the first formant F1. Along the first 50 ms, 
we observe changes in human speech that correspond to formant transi- 
tions proper, whereas for baboon vocalization, the trajectory involves also 
a spectro-temporal variation similar to a formant transition. We conjecture 
that human listeners perceptually infer a formant transition from this trajec- 
tory. The P1 spectral peak for baboon wahoo in {a} is also similar to the 
first formant of the speaker’s [a], and in continuity with the P1 of {w}. In 
the case of baboon {a} filtered at 1.5 kHz, the P1 peak alone is enough to 
get the [a] percept, as is the case with F1 [a] for the human voice (Delattre, 
1968). Then the continuity between the P2 of {w} and the P2 of {a} which is 
not present for baboon, is not necessary for the [a] perception. The P1 & P2 
peaks for {hoo} uttered by baboons correspond to F1 and F2 for human [u] 
in all cases. This allows us to establish that all the acoustic features observed 
in the baboon {wahoo} vocalization can adequately explain the phonetic 
perception [wa.u], which is in line with our acoustic-phonetic hypothesis. 


5. Articulatory question: Is production similar? 


The spectral and temporal similarities that we established in the lower range 
of the spectrum (below 1.5 kHz), did not take into account the P2 peak of 
{a} around 2.3 kHz (Table 1). To explain the correspondence of this peak 
with the second formant in [a] we have to invoke an acoustic production 
model (Boé et al., 2013). This model lets us determine the maximal acoustic 
space within which the first two formants of all vowels occur for a vocal 
tract of the given length (Boé et al., 2017). We have estimated 13.4 cm as 
the length of the vocal tract of a male baboon. Figure 3 shows (in grey) 
the maximum acoustic space (MAS), which is the area enclosing all paired 
(F1, F2) values that could theoretically be produced by such a vocal tract, 
assuming total control of its shape and configurations (see Boé et al., 2017, 
for discussion). The spectral peaks P1 and P2 of {w}, {a}, and {hoo}, pro- 
vided in Table 1 and interpreted as formants, are shown as they would be 
located inside this maximal acoustic space. 
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Figure 3: F1 and F2 for [w], [a], and [u], as shown in the maximum acoustic space 
of a 13.4 cm vocal tract, corresponding to that of an adult male baboon. 
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We observed that, in agreement with our production hypothesis, {a} uttered 
by baboons would correspond to an open front vowel. The {hoo} is located 
where we would expect the mid-high back rounded [o]. The fact that it does 
not seem to correspond really to the high back rounded [u] might be due 
to the fact that lips are not sufficiently closed and protruded. For {w}, the 
first peak, P1, is in continuity with P1 of {a} whereas the continuity between 
P2 of {w} and P2 of {a} is absent. Though this means we do not have an 
actual trajectory for [wa.u] in F1-F2 formant space, we have indicated an 
approximate trajectory schematically by arrows (Figure 3). 


Figure 4: Distances a and b measure the displacement of the thoracic cage; 
distance c measures lip opening; distance d is from the center of the ear 
canal to the distal extremity of the upper lip and quantifies lip protrusion 
(images 6 and 15 of the video). 
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Figure 5: Articulatory analysis of the sample of “wahoo” extracted from the 
video. Synchronization of sound and image with spectrum (Praat, 50 ms 
analysis window), lip opening, rib cage variations, and lip protrusion 
(fitted with a 3" degree polynomial curve). 
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We also directly studied the wahoo production on the video clip. We manu- 
ally estimated and reported three types of measurements from the video, 
image by image: (a) lip opening, (b) movement of the rib cage, and (c) 
lip protrusion (Figure 4). These measurements are synchronized with the 
sound, so as to allow analysis of the production sequence. From anatomical 
data (Boé et al., 2017), knowing that distance between the center of the 
ear canal and upper lip in the rest position is about 17 cm, we estimated 
the conversion factor at 5 pixels/cm. The lip opening gesture takes about 
180 ms (5.5 images) and includes {w} and the beginning of {a}, with lips 
plateauing until its end (image 12), when the lips close to the beginning of 
the {hoo} (image 15). Ingressive airflow is shown by an increase of thoracic 
volume synchronized with {wa} and a plateau that ends at the beginning 
of {hoo} (image 12) together with the closing of the lips. The lip protru- 
sion measurement is highly variable or noisy, with a trajectory fit (using 
a 3" degree polynomial curve) which reaches its maximum when the lips 
close (image 15). 

For the 3 types of measurement we can therefore estimate, relative to the 
rest position, a range of variation through 1 cm for lip protrusion, 8 cm 
for lip opening, and 1 cm for the rib cage movement (Figure 5). The {w} 
lip opening gesture is accompanied by a rise in FO as well as a rise in the 
first formant. This effect is well-known for the [w] (Potter et al., 1947) and 
for stop consonants in general (Wang and Fillmore, 1961), and reflects the 
strong coupling between source and vocal tract. The fundamental frequency 
is correlated with the airflow at the glottis and the transition of the first 
formant corresponds to the opening of the vocal tract. The measurement 
of the protrusion supports the hypothesis of lip protrusion for {wa}. This 
protrusion also seems to involve a forward projection of all the tissue of 
the muzzle. 

Overall, except for the difficulty of showing a clear F2 transition for {w}, 
our hypothesis of similar production mechanisms for baboon vocalizations 
and human speech is supported by our observations, with the caveat of 
course that pulmonic ingression is only paralinguistic for human speech. 
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6. Conclusions 


There is an increasing interest from various research communities in animal 
vocalizations that are used for communication purposes. This is particularly 
true for researchers in primatology, due partly to increased interest in lan- 
guage evolution. We have here compared an understudied baboon vocaliza- 
tion, the “wahoo”, and its onomatopoetic name from human speech. We 
have used standard methods, commonplace in speech research, to analyze 
and compare certain aspects of baboon productions to similar processes 
in human speech. In particular, we used acoustic analysis of FO and of 
spectral characteristics of the baboon wahoo to understand how it is likely 
perceived by humans, and we showed that those acoustic traits indeed 
provide support to its onomatopoetic name. We also used a video of a 
baboon producing a wahoo to extract quantitative articulatory data allow- 
ing us to understand several interesting aspects of the baboon’s production 
mechanism, and we showed that many of them are quite similar to human 
speech production mechanisms. We also verified that ingressive vocaliza- 
tion, which is found paralinguistically but is unusual in human speech, is 
common in the baboon wahoo. We thus transcribe baboon wahoo as [wa]. 
uf] with the | down arrow indicating an ingressive initial syllable, and ft 
for the egressive second syllable. 

We believe we have demonstrated that standard phonetic and acoustic 
methods developed for speech can be profitably used for the analysis of 
vocalizations in non-human animals, and we recommend further explora- 
tory efforts in the same vein. 


Acknowledgements 


The authors are grateful to: Tom Larimer for the YouTube baboon video; 
Thierry Legou, Arnaud Rey, Caralyn Kemp, Yannick Becker for recording 
and labeling the vocalizations of Papio papio baboons; Guillaume Captier 
for the anatomical study; and Pierre Badin, Pascal Perrier, and Jean-Luc 
Schwartz for helpful discussions. This research was funded partly by ANR 
SkullSpeech grant ANR-13-TECS-0011 “e-SwallHome-Swallowing & Res- 
piration: Modelling & e-Health at Home” in the “Technologies pour la 
Santé” program. 


What’s up with Wahoo? 71 


Research supported by grants ANR-16-CONV-0002 (ILCB), ANR- 
11-LABX-0036 (BLRI) and ANR-11-IDEX-0001-02 (A*MIDEX). 
Technical support from the staff of the Rousset-sur-Arc primate center is 
acknowledged. 


7. References 


Boé, L.J., Badin, P., Ménard, L., Captier, G., Davis, B., MacNeilage, P., 
Sawallis, T.R., and Schwartz, J.L. (2013). Anatomy and control of the 
developing human vocal tract: A response to Lieberman. Journal of Pho- 
netics, 41, 379-392. 

Boé, L.J., Berthommier, F., Legou, T., Captier, G., Kemp, C., Sawallis, T.R., 
Becker, Y., Rey, A., and Fagot, J. (2017). Evidence of a vocalic proto- 
system in the baboon (Papio papio) suggests pre-hominin speech precur- 
sors. PLoS One, 12, e0169321. 

Boersma, P., and Weenink, D. (2014). Praat: doing phonetics by computer 
[Computer program]. Version 5.3.63, retrieved 24 January 2014 from 
http://www. praat.org/ 


Byrne, R.W. (1981). Distance vocalisations of guinea baboons (Papio papio) 
in Senegal: an analysis of function. Behaviour, 78, 283-312. 


Calliope (1989). La parole et son traitement. Paris: Masson. 


Cheney, D.L., Seyfarth, R.M., and Palombit, R. (1996). The function and 
mechanisms underlying baboon ‘contact’ barks. Animal Behavior, 52, 
507-518. 


Cheney, D.L., Seyfarth, R.M., and Silk, J.B. (1995). Responses of female 
baboons (Papio cynocephalus ursinus) to anomalous social interactions: 
evidence for causal reasoning? Journal of Comparative Psychology, 109, 
134-141. 

Delattre, P., Liberman, A.M., Cooper, F.S., and Gerstman, L.J. (1952). An 
experimental study of the acoustic determinants of vowel color; observa- 
tions on one- and two-formant vowels synthesized from spectrographic 
patterns. Word, 8, 195-210. 


Delattre, P. (1968). From acoustic cues to distinctive features. Phonetica, 
18, 198-230. 


Eklund, R. (2004). Pulmonic ingressive speech: a neglected universal? 
TMH-OPSR, KTH, 50, 21-24. 


72 L.J Boé, T. R. Sawallis, J. Fagot, F. Berthommier 


Eklund, R. (2008). Pulmonic ingressive phonation: diachronic and syn- 
chronic characteristics, distribution and function in animal and human 
sound production and in human speech. Journal of the International 
Phonetic Association, 38, 235-324. 


Eklund, R., Peters, G., Weise, F, and Munro, S. (2010). A comparative 
acoustic analysis of purring in four cheetahs. FONETIK 2012, Depart- 
ment of Philosophy, Linguistics and Theory of Science, University of 
Gothenburg, The XXV“ Swedish Phonetics Conference, 41-44. 


Fagot, J., Gullstrand, J., Kemp, C., Defilles, C., and Mekaouche, M. (2014). 
Effects of freely accessible computerized test systems on the spontaneous 
behaviors and stress level of guinea baboons (Papio papio). American 
Journal of Primatology, 76, 56-64. 

Fedurek, P., and Slocombe, K.E. (2011). Primate vocal communication: a 


useful tool for understanding human speech and language evolution? 
Human Biology, 83, 153-173. 

Fischer, J., Hammerschmidt, K., Cheney, D.L., and Seyfarth, R.M. (2001). 
Acoustic features of female chacma baboon barks. Ethology, 107, 33-54. 

Fischer, J., Hammerschmidt, K., Cheney, D. L., and Seyfarth, R.M. (2002). 
Acoustic features of baboon loud calls: influences of context, age, 
and individuality. Journal of the Acoustical Society of America, 111, 
1465-1474. 

Fischer, J., Kitchen, D., Seyfarth, R.M., and Cheney, D.L. (2004). Baboon 
loud calls advertise male quality: acoustic features and their relation to 
rank, age, and exhaustion. Behavioral Ecology and Sociobiology, 56, 
140-148. 

Fischer, J., Metz, M., Cheney, D.L., and Seyfarth, R.M. (2001). Baboon 
responses to graded bark variants. Animal Behaviour, 61, 925-931. 
Fitch, W.T. (2002). Comparative vocal production and the evolution of 
speech: reinterpreting the descent of the larynx. In A. Wray (Ed.) The 

transition to language (pp. 21-45). Oxford: Oxford University Press. 

Ghazanfar, A., and Rendall, D. (2008). Evolution of human communica- 
tion. Current Biology, 18, R457-R460. 

Grau, S.M., Robb, M.P., and Cacace, A.T. (1995). Acoustic correlates of 
inspiratory phonation during infant cry. Journal of Speech and Hearing 
Research, 38, 373-381. 


What’s up with Wahoo? 73 


Gustison, M.L., Le Roux, A., and Bergman, T.J. (2012). Derived vocali- 
zations of geladas (Theropithecus gelada) and the evolution of vocal 
complexity in primates. Philosophical Transactions of the Royal Society, 
B367, 1847-1859. 


Hall, K.R.L., and DeVore, I. (1965). Baboon social behavior. In I. DeVore 
(Ed.) Primate Behavior: Field Studies of Monkeys and Apes (pp. 53-110). 
New York: Holt, Rinehart and Winston. 

Kemp, C., Rey, A., Legou, T., Boé, L.J., Berthommier, F, Becker, Y., and 
Fagot, J. (2017) Vocal repertoire of captive guinea baboons (Papio pa- 
pio). In this volume. 

Kitchen, D.M., Cheney, D.L., Engh, A.L., Fischer, J., Moscovice, L.R., and 
Seyfarth, R.M. (2013). Male baboon responses to experimental manipu- 
lations of loud ‘wahoo calls’: testing an honest signal of fighting ability. 
Behavioral Ecology and Sociobiology, 67, 1825-1835. 

Ladefoged, P. (2006). A course in phonetics. Belmont: Thomson, Wads- 
worth. 

Larimer, T.S. (2012). The wa-hu shout. https://www.youtube.com/ 
watch?v=za839cpwUh0 

Maciej, P., Ndao, I., Patzelt, A., Hammerschmidt K., and Fischer, J. (2013). 
Vocal communication in a complex multi-level society: constrained 
acoustic structure and flexible call usage in guinea baboons. Frontiers 
in Zoology, 10, 58-72. 

Maddieson, I. (1986). Patterns of sounds. Cambridge: Cambridge Univer- 
sity Press. 

Orlikoff, R.F., Baken, R.J., and Kraus, D.H (1997). Acoustic and physi- 
ologic characteristics of inspiratory phonation. Journal of the Acoustical 
Society of America, 102, 1838-1845. 

Owren, M.J., Seyfarth, R.M., and Cheney, D.L. (1997). The acoustic fea- 
tures of vowel-like grunt calls in chacma baboons (Papio cyncephalus 
ursinus): implications for production processes and functions. Journal 
of the Acoustical Society of America, 101, 2951-2963. 

Peters, G. (2002). Purring and similar vocalizations in mammals. Mammal 
Review, 32, 245-271. 

Potter, H.C., Kopp, R.K., and Green, G.A. (1947). Visible speech. New 
York: D. Van Nostrand Company, Inc. 


74 L.J Boé, T. R. Sawallis, J. Fagot, F. Berthommier 


Price, T., Ndiaye, O., Hammerschmidt, K., and Fischer, J. (2014). Limited 
geographic variation in the acoustic structure of and responses to adult 
male alarm barks of African monkeys. Behavioral Ecology and Socio- 
biology, 68, 815-825. 

Rendall, D., Cheney, D.L., and Seyfarth, R.M. (2000). Proximate factors 
mediating ‘contact’ calls in adult female baboons (papio cynocephalus 
ursinus) and their infants. Journal of Comparative Psychology, 114, 
36-46. 

Wang, W.S., and Fillmore, C.J. (1961). Intrinsic cues and consonant percep- 
tion. Journal of Speech and Hearing Research, 4, 130-136. 


Zuberbühler, K. (2012). Primate communication. Nature Education Knowl- 
edge, 3, 83. 


Adriano R. Lameira! 
'Department of Psychology and Neuroscience, St. Andrews University, 
St. Andrews, Scotland, UK 
?Department of Anthropology, Durham University, Durham, UK 


Origins of Human Consonants and Vowels: 
Articulatory Continuities with Great Apes 


Abstract: Science has a very pale idea of how proto-speech sounded like and how 
it was composed. The probability of an accurate reconstruction of speech and 
language evolution is, however, vanishingly small without such information. The 
original nature, form, and function of the building blocks of proto-speech directly 
determined which communicative operations and linguistic computations were 
possible. Knowledge about the sounds that composed the first syllables and words 
can offer, thus, insight into the chains of events that launched language evolution. 
Primate bioacoustics in our closest relatives — nonhuman great apes — represents a 
rich source of information on the probable composition of the ancestral great ape 
call repertoire that predated human speech. Here, I illustrate how the long-term 
inventory of the wild and captive call repertoire in orang-utans and African apes 
has unveiled a deep articulatory homology between great ape voiceless and voiced 
calls on the one hand (i.e. with and without vocal fold action as sound source, 
respectively) and human consonants and vowel on the other. This articulatory 
parallel offers a clearer view over the basic sounds that composed the “mother 
tongue” of all the world’s spoken languages. The presence of proto-vowels and 
proto-consonants in the last common hominid ancestor spawns new questions 
regarding the steps that made up the process of speech and language evolution 
and their relative timing. 


Keywords: proto-speech, orang-utans, African apes, call repertoire 


1. Introduction 


Given how fundamental language is to what defines us as humans, we know 
surprisingly little about its origin. Historically, the birth of language was 
something of remarkable. Since the first civilizations and across cultures 
over millennia, speech was a trait granted by divine forces and that sepa- 
rated humans from other animals (e.g. “In the beginning was the Word and 
the Word was with God and the Word was God”, John 1:1). This theme 
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is, for example, covered in manuscripts from the Indus Valley; the oldest 
philosophical essays that survive until present, dated to the ninth century 
BCE and predating Classical antiquity by hundreds of years (Favareau, 
2010). An interesting reflexion of these concepts in folk culture can be 
found in fables and myths, where talking animals held a human mind and 
mental capacities (e.g. the Big Bad Wolf of Red Ridding Hood was Machi- 
avellian, capable of deception, and disguise) and other uniquely human 
behaviours, such as standing upright and wearing clothing. On the other 
hand, human-like creatures that lacked speech preserved their monstrosity 
(e.g. Cyclops of Greek mythology). 

In this chapter, I will lay out recent lines of evidence for the presence 
of evolutionary raw material for the emergence of human consonants and 
vowels among nonhuman great apes (hereafter great apes) — our closest 
living relatives — with a special focus on orang-utans (Pongo spp)!. His- 
torically, these data represent a tipping-point in the theory of language 
origins; the way we view and think about human evolution, the emergence 
of our most diagnosing behaviour, and the origins of the most advanced 
communication system known in the natural world. This is an exciting 
period to be in in the field of language evolution, where both new and 
senior generations of researchers can overcome centuries-old mythical 
notions and pose, test, and advance new scientific hypothesis, and in this 
manner, make significant contributions to solving one of the oldest puz- 
zles in human thought. 


The plum tree and the plum seed 


One of the dominant approaches to date to the study of language evolu- 
tion has been through the identification of basic cognitive operations and 
computational mechanisms that underpin language use today. This ap- 
proach has sought identifying what is truly and uniquely human and that 
makes language possible. The fundamental capacity for compositional 


1 This chapter explores and focuses on possible parallels in articulation and acous- 
tics, and remains agnostic on interpretations of call function and meaning (new 
efforts on this front have been done elsewhere (Schlenker et al., 2016)). A separate 
approach to these aspects of language evolution, i.e. articulation vs. meaning, has 
been defended in the Frame/Content Theory by MacNeilage (1998). 
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and combinatorial syntax has emerged as the key feature presumed to 
have transformed us into the talking ape (Hauser et al., 2002). From this 
position, scientists have searched for similarities within the primate order, 
but with little success (Fitch and Hauser, 2004) and some of its propo- 
nents have declared this approach to be an heuristic dead-end (Hauser 
et al., 2014). 

Because human language today is so sophisticated, the weakness in this 
traditional approach can be understood with the aid of the following image. 
Think of a mature plum tree, or the highest sequoia. It is nearly impos- 
sible to conceive how such a tree could have arisen from its minute seed, 
particularly so without a priori knowledge about plant development. When 
we observe the tree, we see fruit, flowers, and photosynthesis. Yet, none of 
these features can be found back in the seed. Surely, then, the tree did not 
come from its seed! 

A new approach to language evolution, resorting to comparative primate 
research, is seeking to study the development of the seed of language, start- 
ing from the seed’s end and study its development forward in time, instead 
of starting from the fully grown tree and working backwards through its 
development. In this manner, we might be able to identify the conditions 
under which the seed germinates and takes the first roots, which structures 
capable of photosynthesis emerge first, and which shapes the first leaves 
take on. With this information, we might just be able to reconstruct how 
the language kernels that lay dormant within great apes may have become, 
over evolutionary time, the tall tree of human language. 


What constitutes a probable precursor of speech? 


Speech is fundamentally learned. Similar to cells that are substituted and 
renewed without the loss of their comprising tissues and organs, so too 
are languages. They are culturally renewed one generation after the other, 
through vocal learning by the young cohorts of their speakers. Each child, 
as anew member of a linguistic community, needs to receive acoustic input 
and feedback and will, through these means, learn to acquire the sounds 
that constitute her mother tongue. 

Accordingly, it is helpful to assume that a particular primate call behav- 
iour represents a conceivable precursor of a speech sound when, at least, 
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that behaviour is the result of vocal learning. Learned calls contrast with 
other calls that are in essence genetically inherited, often termed innate. 
Although this dichotomy is not always clear-cut (Fitch, 2010), it is heuris- 
tically valuable as an entry point for screening primate call repertoires for 
potential language precursors. Innate calls emerge inevitably in normally 
developing individuals, without the need for relevant auditory input. Con- 
versely, the process of acquisition of a new call into one’s repertoire through 
vocal learning is underpinned and primarily driven by auditory input (Ow- 
ren et al., 2011). This input is mandatory. Without it, no acquisition of new 
calls can occur and vocal learning can, then, manifest in two ways: via the 
capacity to acquire new calls (expanding in this way one’s vocal repertoire) 
or via the ability to modify a call previously acquired (Pisanski et al., 2016). 

The more we retreat in time along primate phylogenetic branches, the 
larger the uncertainty about how influential an ancient behaviour was in 
the process of language evolution. A higher level of attention and study 
should centre, thus, on vocal learning in great apes. Due to their close 
relatedness to our own species, their vocal behaviour and capacities will 
likely provide us with pertinent and plausible hypotheses for language 
evolution and a comparative platform to perhaps further explore older 
primate precursors. Among great apes, orang-utans have emerged as a 
particularly surprising species. Below, I review the most recent findings 
in this genus and how they relate with the latest studies involving African 
apes. 


2. Are orang-utans a suitable model species for studying 
language evolution? 


In the early 2000s, the study of the call repertoire of wild orang-utans 
started in its earnest in the swamps of the central province of Indonesian 
Borneo by a team of Dutch, Swiss, and Portuguese researchers (Hardus 
et al., 2009). Until then, all information that was available derived from 
pioneer work done back in the 60s and 70s (Mackinnon, 1974; Rijksen, 
1978). These references sometimes lacked spectrograms and re-assessment 
had to be done strictly on the basis of written descriptions, which proved 
to be particularly challenging. 
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The reason why the first modern comprehensive description of the 
orang-utan call repertoire took decades longer to be published than, for 
example, the landing of the first man on the moon, was in part due to 
a known relationship between primate sociality and vocal complexity 
(McComb and Semple, 2005). This relationship postulates that the larg- 
er the typical group size of a primate species, the larger and richer its call 
repertoire tends to be. It followed from this correlation that orang-utans, 
being the only diurnal primate to exhibit solitary tendencies (Delgado 
and van Schaik, 2000), were predicted to produce a very small range of 
vocalizations and sounds. As it became apparent during the work in the 
swamplands of Borneo, this correlation fell short in explaining the high 
diversity and richness of the orang-utan call repertoire. 

As the earliest diverging hominid genus, orang-utans were often ar- 
gued by some to offer close to no insight to the question of language 
evolution. Some scholars contended this was because orang-utans show 
the least level of genetic similarity with humans. Interestingly enough, 
this is in fact inaccurate (!), specifically, with regards to the genetic 
mutations that presumably played a critical role in language evolution — 
mutations associated with the gene encoding the forkhead box protein 
2 (FOXP2)(Enard et al., 2002). 

African apes show a difference of two amino acid substitutions in 
this gene with humans, but a difference of three substitutions with or- 
angutans. That is, orangutans do not share one amino acid difference 
that is common to all African apes. Thus, orangutans exhibit an extra 
amino acid substitution that is unique to Pongo. Because FOXP2 gene 
is extremely conserved in mammals (Enard, 2011), the last great ape 
common ancestor exhibited in all likelihood an African-like FOXP2 
genetic profile, similar to that found in chimpanzees, for instance. This 
gene went on to undergo subsequent amino acid substitutions only in 
Homo and Pongo lineages — making orangutans at least as good model 
species as any African ape. If the number of mutations on FOXP2 can 
be assumed to reflect stronger selective pressures for vocal evolution, 
orang-utans stand then as the most promising model species among great 
apes for the study of language evolution. 
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One further reason why orang-utans make excellent model species for 
the study of language evolution derives directly from their vocal behav- 
iour. It is in this domain that most progress has been made regarding 
our understanding of the sounds and structure of the human ancestral 
language. 


3. Vocal differences between the Asian great ape and its 
African cousins (gorillas, bonobos, and chimpanzees) 


3.1 Dialects 


During the process of cataloguing the call repertoire of wild orang-utans, 
a remarkable feature of their vocal communication system transpired. 
Different populations produced different calls in the same context, where- 
as other populations used no call whatsoever in those same circumstances 
(Kriitzen et al., 2011; van Schaik et al., 2003; Wich et al., 2012) (Fig. 1). 
This pattern is identical to what classifies as dialects in humans (Lameira 
et al., 2010), where synonym words are used alternatively in different 
locations of the same language. For instance, the use of “trousers” and 
“pants” between British and American English respectively. In orang- 
utans, these are calls, for instance, that mother orang-utans produce 
to call their infants during travelling and calls produced during nest 
construction by adult individuals each day. As shown in Figure 1, while 
orang-utans produced raspberries (consisting of blowing air through 
pursing lips) during nest construction in Suaq and Sabangau — two 
far-distant research sites across the Karimata Strait between Sumatra 
and Borneo — nest smacks (seemingly produced in an similar way as a 
click consonant, with the tongue quickly shooting away from the palate) 
were produced instead up river from Sabangau, in Tuanan - a popula- 
tion of the same subspecies and likewise living in peat swamp. At the 
same time, orang-utan mothers produced local-specific throat scrapes 
towards their infants in Tuanan, and harmonic uhh’s in Ketambe. Other 
populations in the same island of the same (sub)species produced no call 
in the same context. 
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Figure 1: First letter code refers to the kind of nesting call (R=’Raspberry‘; S=’Nest 
Smack‘; — = no call). Second letter code refers to the mother-infant call 
(U=’Harmonic uuh‘; T=’Throat scrape‘; — = no call). (Reproduced from 
Wich et al., 2012). 


The geographic distribution of these “present1/present2/absent” patterns, 
or “synonym dialects” in orang-utans — where and which calls are pre- 
sent or absent — was not sufficiently explained by genetic or ecological 
divergence (Kriitzen et al., 2011). That is, levels of genetic and ecological 
dissimilarity between populations and sites did not correlate with the level 
of dissimilarity between the call repertoires of those populations (Wich et 
al., 2012). 

In fact, differences of this nature would be nearly impossible to explain 
theoretically in terms of genetics or ecology. There is no knowledge of 
genetic mutations that turn on, or off, or replace specific calls strictly in 
one context and, simultaneously, leave the remaining of the call repertoire 
unmodified at the same time. On the other hand, orang-utan mother calls 
and orang-utan nest calls exhibit no ecological requirements. An orang- 
utan mother usually calls for her infant when the latter falls behind while 
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travelling across the forest canopy, regardless of the species of tree they are 
crossing at that particular moment. Equally, orang-utans build nests across 
a huge range of different trees species within a single site. To explain the 
production of different mother calls, or nest calls, or no call at all between 
sites with basis on forest species composition is, therefore, very problematic. 

Instead, the number of dialect calls (together with other cultural behaviours) 
exhibited by a particular population was best predicted by the percentage of 
time that individuals spent together with each other in social association (van 
Schaik et al., 2003). Because more time spent together meant a higher number 
of opportunities for learning between individuals, this led to an increase in 
the cultural repertoire that those individuals could assemble. These results 
denote that calls (when present) were indeed highly likely learned. They were, 
thus, probably maintained as local traditions through vocal learning between 
individuals of the same “linguistic” community (van Schaik et al., 2003). 

Synonym dialects are still to be described among African apes, but many 
descriptions exist of a more fundamental type of dialect — “simple” present/ab- 
sence patterns of call across populations. In Pan, this pattern has been observed 
between different captive populations (Hopkins and Savage-Rumbaugh, 1991; 
Hopkins et al., 2007; Marshall et al., 1999) and wild ones (Watts, 2015). A 
recent study has reported it in Gorilla between wild populations (Robbins et 
al., 2016), and some isolated cases are also known in captivity (Lameira et al., 
2014; Perlman and Clark, 2015). Altogether, this evidence reflects the capac- 
ity of all great ape genera to generate and maintain local, population-specific 
vocal traditions (Robbins et al., 2016; van Schaik et al., 2003; Whiten et al., 
1999). Accordingly, dialects seem to exhibit more complex patterns in orang- 
utans than in African apes and synonym dialects have only been described in 
wild orang-utans, thus far (Wich et al., 2012). However, a renewed interest 
and continued examination of the African repertoire will almost certainly 
reveal variation that has remained hitherto overlooked. 


3.2 Accents 


Data currently available suggest that chimpanzees and bonobos (gorilla 
data are much in need!) exhibit, instead, a richer variation in terms of 
what is characterized as accents in humans. Accents describe differences in 
the “pronunciation” of the same call, such as the difference between how 
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British English and American English speakers pronounce “tomato”. This 
level of variation in African apes seems richer than what has been described 
so far in orang-utans, even though comparisons between call use and their 
possible function between species is highly challenging, also because the 
large majority of calls are not shared between species (Clay et al., 2011; 
Fedurek et al., 2015; Slocombe and Zuberbuhler, 2007). 

In chimpanzees this level of variation has been shown to denote that 
vocal learning is involved in call production (Watson et al., 2015). This 
was exemplarily demonstrated in a unique study that managed to record 
chimpanzee food calls before and after a group housed in the Netherlands 
moved to the Zoo of Edinburgh (Watson et al., 2015). The food call variant 
directed for “apples” by the individuals at each group was, before the merge 
of the two groups, acoustically different. However, once the Dutch group 
moved to Scotland, their “apple” call gradually shifted over the period of 
some months to converge and become acoustically indistinguishable from 
the variant produced by the resident group (Watson et al., 2015). Char- 
acteristically, accents can thus manifest in the form of call convergence/ 
divergence between groups and this has also been demonstrated to occur 
between individuals in wild chimpanzees (Mitani and Gros-Louis, 1998). 

Accent differences tend to be more difficult to be detected by ear than 
dialects. While dialects are composed of distinct calls that facilitate their 
identification, differences in accent occur within one single call type and are 
therefore subtler. Experience (in the form of many hours of observation) 
helps a great deal in the audible detection of accent differences and acoustic 
analyses are typically required to verify the occurrence of accents (much as, 
similarly, hours of experience with a particularly novel language starts to 
allow us to understand where word and sentence cut-off points lay). 

I recall a conference when a colleague, who is an expert in chimpanzee 
calls, gave a presentation where two different chimpanzee calls were played 
out to the audience. The presentation slides showed that the two calls dif- 
fered in acoustics and in the way chimpanzees used either call, but I was 
baffled to how I was nearly incapable of detecting these differences in sound 
by ear. To someone who has made a scientific career by describing varia- 
tion and differences between call types in orang-utans, this was admittedly 
embarrassing for me! This was a clear example of how there may possibly 
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exist a quantifiable difference between the degree of graded-ness between 
orang-utans and their African cousins. 

This called my attention to the fact that the vocal features exhibited by 
each species must be interpreted within the context of their natural call be- 
haviour and the nature of their repertoire. Chimpanzees and bonobos, for 
example, exhibit a graded call repertoire where acoustic transitions between 
different calls are fine, sometimes elusive, but imbued with straightforward and 
powerfull differences in function. Receivers will react very differently to call 
variants that will sound much the same to an inexperienced human observer. 
Conversely, orang-utans exhibit a more categorical call repertoire. This means 
that functional differences between calls predominantly involve the use of clear- 
cut acoustic differences that unambiguously demark two different call types. 

Under this light, a richer dialect or accent variation does not necessar- 
ily mean that one species is “better” model, or more advanced, than the 
other. If a species like orang-utans produces a repertoire typically exhibit- 
ing categorical differences between calls, then it can only be expected that 
orang-utans exhibit a rich variation in dialects because dialects involve 
differences between calls. If a species like chimpanzees or bonobos produces 
a repertoire typically exhibiting graded differences between calls, then it 
should be expected that chimpanzees or bonobos exhibit a rich variation 
in accents because accents involve differences within calls. Future research 
will be needed to accurately quantify these seeming differences between 
graded and categorical repertoires in great apes (Wadewitz et al., 2015). 
Nevertheless, the data coming in, thus far, suggest that interpreting any 
feature or trait yet to be found will need to be done with caution. Common 
capacities and skills (e.g. vocal learning) may in fact manifest differently 
across species, depending on the features of the repertoire of those species. 

For now, great apes prove to display remarkable features in the inter- 
ception of local traditions and vocal behaviour, which is exactly where 
potential language precursors are to be found. To understand how vocal 
evolution in great apes took shape and how language evolution ensued, we 
need an ever more comprehensive, inclusive, and integrated framework that 
will hopefully include a growing amount of audio recordings and behav- 
iour data from all living great apes. This endeavour is bound to bring new 
insights into the evolution of language in our lineage. 
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4. Discovery of two major call categories in orang-utans 


One of the most challenging tasks during the inventory of the orang-utan 
call repertoire (besides waking up at 4am and spending the day knees-deep 
in a swamp for 10 months!) was the fact that we were commonly faced 
with calls that exhibited features not described in the primate literature. 
In an age when we stand on the shoulders of giants, the lack of previous 
data dwarfed us. 

Besides voiced calls, typically termed “vocalizations” and characteristi- 
cally produced by all primates and terrestrial mammals (Taylor and Reby, 
2010), the repertoire of orang-utans was proving to be particularly rich 
in noisy smacks, clicks, kiss sounds, and raspberries. These calls did not 
involve individuals’ voice (which is the result of the regular oscillation of 
the vocal folds). Instead, call production resulted from the action of the 
supralaryngeal articulators — the lips, tongue, and jaw — and airflow gener- 
ated by their own manoeuvres or by the action of abdominal musculature 
(e.g. diaphragm). Laid out in a spectrogram (i.e. a graphic means to visual- 
ize sound) these unvoiced calls typically exhibit distinctive traits from their 
voiced counterparts. Their distinct articulatory nature inescapably generates 
distinct acoustics. 

At the time, we were unsure the extent to which these two different 
means of how orang-utans engage in vocal production — via voiced and 
voiceless call production — could be meaningful. 


4.1 Speech building blocks 


This binary aspect of the orang-utan vocal system might become clear when 
we take a closer look into the world’s languages. Each and every human 
language is inherently, and by definition, composed by vowels and conso- 
nants — without exception. Acoustic and articulatorily, these two building 
blocks of speech are not equivalent. Vowels are characteristically voiced, 
using the activation of the vocal folds and their regular oscillation, whilst 
voiceless utterances in humans are characteristically consonants. 

In fact, voiceless consonants dominate over other types of consonants 
in a large sample of human languages — 64% of the plosives, 72% of 
the fricatives and 74% of the affricates are unvoiced (Vallée et al., 2002). 
Moreover, voiceless consonants are found universally across the world’s 
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languages, while other types of consonant may or may not be present in 
particular languages (Lameira et al., 2014). In addition, evidence indicates 
that the original language before the exodus out of Africa, between 140,000 
and 60,000 years ago, was particularly rich in voiceless consonants that 
characterize several African languages today (Atkinson, 2011; Knight et 
al., 2003). Even though many modern day consonants engage the voice, 
and voiced consonants exhibit a rich variation across the world languages 
(see for example, www.internationalphoneticassociation.org), the wide, 
rich, and time-deep presence of voiceless consonants in humans supports 
the view that great ape voiceless calls could be used as a desirable model to 
study the production and use of consonant-like calls in ancestral hominids 
during the first stages of language evolution. 

There is, thus, an apparent parallel between the composition of the 
orang-utan vocal system and human language with regards to the articula- 
tory commands and acoustic output underlying the two elementary particles 
of both systems. However, establishing an evolutionary link between the 
two requires establishing that great ape consonant-like and vowel-like calls 
are the result of vocal learning, as it occurs in humans. 


4.2 First pillar: Consonant precursors 


In orang-utans, voiceless calls preponderate orang-utan synonym dialects 
(Wich et al., 2012). In other words, when orang-utans produce population- 
specific calls as part of synonym dialects, these tend to be voiceless calls. 
This observation hints that voiceless calls in great apes are the result of 
vocal learning, and therefore, that they could allow establishing a link with 
human consonants. Empirical tests in captivity, that present the benefit of 
controlled settings, proved to be essential in testing this possibility. 

The scientific discovery of the first whistling orang-utan provided the 
ideal conditions for a deeper investigation (Wich et al., 2009). Bonnie, a 
captive orang-utan, was known for many years among her caretakers to 
know how to whistle like a human. We were, however, perplexed when we 
caught wind of the news since this directly suggested that vocal learning was 
operating in captivity, allowing individuals to enlarge their repertoire with 
new calls as we had observed in the wild. Bonnie protruded her lips and, 
with gentle blows of air through the space in between, produced whistles. 
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The likelihood that Bonnie had acquired whistling through vocal learn- 
ing was very high because whistles are very particular calls. Whistles are 
melodic and tonal as voiced calls commonly are, but they do not involve 
the voice (as anyone who can whistle knows) nor do they have formants 
(frequency resonance bands) that are characteristic features in voiced calls. 
They are the result of the airstream’s periodic vortex shedding at the lips 
opening and generally exploit a very narrow frequency band where most of 
the acoustic energy is concentrated. They qualify as a rather unique type of 
voiceless call. No call with these features is known to exist in the primate 
order, with the obvious exception of human whistles, of course. Accord- 
ingly, Bonnie had very likely learned this call from humans. 

The premise for our sound tests with Bonnie formulated the follow- 
ing: if Bonnie had indeed learned how to whistle through vocal learning, 
then she should be able to exert sufficient control over whistle production 
to alter some of its main acoustic parameters. In order to non-invasively 
prompt Bonnie to produce whistles, we presented her with a “do-as-I-do” 
paradigm. In this test setting, a human demonstrator produces model calls, 
implicitly requesting the subject (in this case, Bonnie) to produce back the 
same type of call. Great apes do particularly well at this imitation game 
and promptly understand what is wished from them. Through these means, 
Bonnie produced single whistles in response to single human whistles, dou- 
ble whistles in response to double whistles, short to short, and long to long. 
These results proved that Bonnie aptly controlled whistle production with 
enough accuracy to match simple human whistles. 

After Bonnie, we have come to know the existence of, at least, ten cap- 
tive orang-utans who have learned how to whistle, including some who 
learned from other whistling orang-utans (Lameira et al., 2013) (see video 
clip here: https://youtu.be/FMuLKoKILBw). Audio and video recordings of 
these individuals has shown that each individual uses its own lip style to 
produce whistles, producing for instance whistles with in- and out-airflow. 
Subsequent tests with some of these orang-utans have shown they too can 
match in-out and triple whistles produced by a human under the “do- 
as-I-do” settings. Altogether, these data confirm that vocal learning is a 
widespread phenomenon in orang-utans, and that individuals can acquire 
a very fine level of motor control over their lips’ positioning and movement, 
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as well as over the muscles involved in creating airflow through the vocal 
tract, including the diaphragm and other abdominal musculature. 

With this new insight in our minds we started to prospect the reper- 
toire of the other great apes in search of voiceless calls. If orang-utans 
were learning voiceless calls at much higher rates than ever believed to 
be possible in captivity, and if these calls had indeed an evolutionary link 
with human consonants, then African great apes ought to be expected 
to exhibit some of the same flexibility too. Some studies at the time had 
already confirmed the production of voiceless calls by captive chimpan- 
zees (Hopkins et al., 2007). New examples continue now to emerge in 
chimpanzees in the wild (Watts, 2015), as well as in wild gorillas (Rob- 
bins et al., 2016) and some cases reported in captive gorillas (Lameira et 
al., 2014; Perlman and Clark, 2015). All these studies maintained that 
some level of vocal learning was necessary to explain the production and 
use of voiceless calls by some individual(s)/population(s) but that were 
otherwise absent elsewhere. 

In fact, chimpanzee research has brought a new line of evidence support- 
ing the view that great ape voiceless calls are vocally learned. Exciting new 
advances in the field of neuroscience have confirmed that chimpanzees who 
have learned to produce voiceless calls (Taglialatela et al., 2012) exhibit 
different neural networks in their brain from those individuals who have 
not learned how to produce voiceless calls (Bianchi et al., 2016). Critically, 
vocal learning individuals exhibit increased grey matter in the ventrolat- 
eral prefrontal and dorsal premotor cortices — constituent regions of the 
equivalent area to Broca’s in the human brain. These regions with observed 
reorganization are responsible for orofacial motor control, demonstrating 
that these individuals required practice and development of enhanced con- 
trol to produce voiceless calls. 

Given the possible articulatory and acoustic homology with human 
voiceless consonants and accumulated evidence showing that great ape 
voiceless calls are learned, these calls can therefore be sensibly advanced pu- 
tative precursors of human speech sounds. Notably, I suggest that great ape 
voiceless calls can be advanced as putative precursors of human consonants. 
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4.3 Second pillar: Vowel precursors 


Once the evolutionary link between great ape voiceless calls and human 
consonants is suggested, the second link stands out conspicuously: primate 
voiced calls have probably given raise to human vowels. The articulatory 
and acoustic parallel between the two is not new (Owren et al., 1997). In- 
deed, voiced calls are characteristic of virtually all mammals. In primates, 
however, they seem pervasively innate. Establishing this second evolution- 
ary link between primate voiced calls and human vowels requires, thus, 
some evidence for an active role of vocal learning. Vowels precursors should 
be the result of vocal learning and maintained across peers and generations 
through cultural mechanisms. Could it be instead, however, that motor 
control over vocal fold action is so difficult that learned voiced calls in great 
apes are simply absent? 

We did not need to search far to confirm that great apes can learn voiced 
calls! It was during our efforts of registering all known whistling orang- 
utans that we knew Tilda. She is a wild-born orang-utan now well into her 
forties. There are no known records of her arrival in Europe. As any great 
ape smuggled into Europe in the pet trade, she was probably brought in as 
a baby and the first information we found only related to her adolescence 
onwards, when she was acquired by a private collector. As far back as 
we could verify, Tilda was known to whistle. There we were then, in the 
Cologne Zoo, Germany, where she lives now (painting and selling canvas 
to raise funding for her family members still living in the forests of South- 
east Asia!). Audio recorders ready, cameras rolling, we were all set. We 
confirmed her whistling capacity but then we were shown something that 
surpassed anything we had seen. 


We were left speechless. 

Tilda babbled to us (see video clip here: https://youtu.be/ab59zcsV35k)! 

As in the case of Bonnie and all other whistling orangs, her caretakers 
knew for years that she could do this and that this was part of how she 
gathered human attention to request for food (Lameira et al., 2015). Ina 
display that we had never observed in the wild or captivity, Tilda moved 
her lips and jaw at a very fast pace, producing vocalizations and sounds 
that could have been easily attributed to a Disney movie character if we 
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were standing in front of the TV instead of an amazing great ape facility. 
Based on the video recordings collected, we analysed the pace at which 
Tilda moved her mouth (Lameira et al., 2015). Results showed that both 
calls that she produced in this manner showed a rhythm of about five calls 
per sec, with one call being voiced and the other voiceless. This fast pace is 
equivalent to the same at which you and I produce vowels and consonants 
during normal speech, which translates into five vowels and five consonants 
per second. 

These findings are theoretically important because they verify that the 
acoustic and articulatory parallel proposed between voiceless and voiced 
calls on the one hand, and vowels and consonants on the other, makes 
empirical sense. Notably, these data provided a new level of similarity 
between these elements, notably, regarding their articulatory rhythm. The 
two great ape call categories can be produced at the same delivery rhythm 
as the two human speech elementary blocks. In comparison with the known 
orang-utan repertoire, the observed rhythm in Tilda’s calls was seven-fold 
faster than any call described before (Lameira et al., 2015). Hence, the most 
likely and evident explanation is that Tilda acquired these calls through 
vocal learning from humans. Besides a new rhythmic parallel with human 
utterances, here we had the first indication in captive orang-utans that vo- 
cal learning of voiced calls — requiring activation of the vocal folds — was 
in reach. This mounted on preliminary data from the wild, where one of 
the existing synonym dialects involved a voiced call (Wich et al., 2012). 

As in the case of orang-utan whistling, this evidence required experi- 
mental confirmation and we wanted to double-check our premise. If orang- 
utans in captivity were learning new voiced vowel-like calls, then they 
ought to be able to control vocal fold action to an extent where they could 
alter key voice parameters of the calls and match human demonstrations 
in real-time. However, we wanted to respect Tilda’s life history, which very 
likely involved dubious relations and experiences with humans and so we 
pre-empted ourselves from running tests with Tilda for ethical reasons. 

However, we knew Rocky! Rocky is an orang-utan teenager who, by the 
age of four had already been in photo-shoots with the famous pop music 
band Black Eyed Peas and made Levi’s TV commercials. Rocky and his 
mom (who is a known whistler) retired from the entertainment business 
and were received at the former Great Ape Trust of Iowa. Today they live 
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in one of the most advanced great ape facilities in the world at the Indian- 
apolis Zoo. We had known Rocky since his arrival at the Trust. By then, 
he already produced a very distinctive call that we also had never heard in 
the wild or captivity. We coined these calls “wookies” because they repeat- 
edly recalled us of the orang-utan-looking Star Wars character, Chewbacca. 
Wookies provided a perfect opportunity to address the question that was 
confronting us. Rocky was young, active, and eager to engage with humans 
and we knew he would take full advantage of any opportunity to show 
off his talents. 

The first step in our research plan was to confidently confirm that wook- 
ies were indeed unique to Rocky. We wanted to make sure that wook- 
ies were not an otherwise common, but misclassified orang-utan call. We 
scanned our database for the closest known call produced by orang-utans 
in the wild. We then acoustically compared this call with wookies. Results 
indicated that their acoustics and underlying articulation was distinct and 
that they constituted two different call types (Lameira et al., 2016). Wookies 
were, thus, unique to Rocky and they represented a new call hitherto not 
described in orang-utans. 

The second step in our test agenda was to present Rocky with the “do- 
as-I-do” imitation game, as we had done with whistling orang-utans. In this 
context, a human demonstrator presented Rocky with high and low pitch 
approximations of wookies as an implicit request for him to reply with 
wookies of similar acoustics (Fig. 2). To produce high and low frequency 
(Hz) wookies, Rocky was required to exert real-time and fine control over 
his voice. Specifically, Rocky had to be able to control the tension of the 
laryngeal muscles associated with his vocal folds and adjust their oscilla- 
tion frequency to that of the human demonstrations. Results confirmed 
our suspicions — Rocky had vocally learned to produce these new calls. 
Whenever the human demonstrator would raise or lower the pitch of her 
voice, so did Rocky within centiseconds (see video clip here: https://youtu. 
be/Lg50_1RScOE). The voice frequency of the human demonstrator and of 
Rocky significantly correlated positively with each other, and the high and 
low pitch wookies that Rocky produced were dramatically different from 
each other, as well as from wookies that Rocky produced spontaneously 
(that is, not in reply to human demonstrations). Indeed, we can see on Fig- 
ure 2 that the correlation is highly significant, with a Spearman R above 
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0.5, meaning that the human voiced explained the majority of the variation 
in Rocky’s responses and that no other factor could have exhibited a higher 
explanatory power. This meant that the human demonstrations were in fact 
guiding Rocky’s voiced, namely, away from his typical voice range used 
during wooky production, and not the other way around. 


Fig 2: Maximum frequency of human wookie demonstrations against maximum 
frequency of Rocky’s match wookies (linear trend line with intercept 
suppressed). (Reproduced from Lameira et al., 2016). 
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Even if one wonders that Rocky could have pulled off a “clever Hans” trick 
(i.e. anticipating by means of slight cues what the human demonstrator 
was about to do), this would still not affect our fundamental conclusion 
that Rocky controls voluntarily his voice. This is because, in such sup- 
posed scenario, Rocky would be responding to other cues or signal than 
the human voice, but one would still be left with explaining why Rocky’s 
high and low calls were significantly different from spontaneous ones that 
were not produced in response to human demonstrations. There would be 
further issues with such hypothetical interpretation. One would need to 
explain why, instead of a direct voice-voice match, Rocky’s was employ- 
ing a multimodal match, using a cue or signal in a sensorial channel but, 
nevertheless, responding acoustically. 
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With this evidence, we can now cement the cornerstone of the second 
pillar of the new theoretical edifice of language evolution. Together with 
the first pillar, sufficient is now known about the call repertoire of great 
apes for new evolutionary links and research venues to be drawn. Human 
vowels and unvoiced consonants are to a certain extent homologous in 
terms of articulation, acoustics, and rhythm to voiced and voiceless calls in 
great apes. Claims that no evolutionary seeds for human spoken language 
can be found within our closest phylogenetic branch are precipitate and 
uninformed (and typically made by scholars who have never studied great 
apes!). Such past claims were based on absence of evidence, not evidence 
of absence. As we have seen so far, a new generation of great ape studies is 
starting to flow in and fill what was more of a deep gap in our knowledge 
about great ape vocal behaviour, than an evolutionary gap between human 
spoken language and great ape vocal systems. Now that this new evidence 
is emerging and amassing, we cannot afford to continue ignoring great ape 
vocal capacities and turn our backs to the news frontiers they unlock if we 
are to crack the evolution puzzle of language evolution. 


5. Evolutionary trajectories 


As we have seen, as a plum seed to a full-grown plum tree, it is important to 
retain the notion that language proto-stages do not need to have exhibited 
in the past the same features of full blown speech as today. Consonants 
exhibit today varied forms beyond those that are voiceless and many con- 
sonants engage vocal fold action — as in the voiced plosive /b/ or the nasal 
plosive /m/ -as vowels characteristically do. For this reason, linguists rather 
delineate the difference between modern day consonants vs. vowels as a 
measure of the constriction of the vocal tract required for production, with 
consonants being produced with relatively closed vocal tracts, and vowels 
with relatively more open configurations. Linguists do not rely, therefore, 
on a definition based on voiced-ness, as great ape data suggest may have 
been the case regarding proto-consonants vs. proto-vowels. 

Today, consonants and vowels are also produced at a fast pace in intri- 
cate and swift alternation to compose words and sentences. Their transition 
is fluid. This rhythmic aspect of speech has also been suggested as a evo- 
lutionary forerunner of speech, in the milestone theory known as “Frame- 
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Content” (MacNeilage, 1998). Fundamentally, this theory suggests that the 
continuous mouth open-close alternation characteristic of speech derives 
from ancient mammal behaviour, such as suckling and chewing, which 
then took on communicative functions. Due to the predominant role that 
articulation plays in this theory, it has been one of the few that has so far 
proposed equal and parallel importance to consonants and vowels in the 
process of language evolution (Lameira, 2014; Lameira et al., 2014, 2016). 
Namely, it has recognized that consonants and vowels are respectively as- 
sociated with the closed and opened phase of the mouth cycle. Linguists 
proponent of this theory commonly do not rely, therefore, on a definition 
of consonants and vowels where they occur separated and unchained. 
Linguistic evidence and comparative data on great ape vocal behaviour 
do not need to be incoherent with one another, however, nor do great ape 
data challenge the Frame-Content Theory. Let us see why this is the case. 
Importantly, one must appreciate the point in the timeline of language 
evolution at which linguists work and that at which great ape vocal re- 
searchers do. Linguistics, through the reconstruction of language-trees can 
recede up to 50 thousand (50,000) years ago (Gray and Atkinson, 2003). 
Great ape researchers, using our closest relatives as living models of ancient 
hominids, work within a frame up to 10 million (10,000,000) years ago — 
the time when our last great ape common ancestor lived. These timeframes 
differ by several orders of magnitude. Converted to seconds, linguistics 
work down to 14 hours ago from the present. Great ape researchers work 
at a point 4 months ago from the present. It is futile and heuristically un- 
productive to argue that these two points in language evolution are not 
connected simply because they are not fully aligned. Indeed, no one would 
expect 9,950,000 years of evolution go by without change or advance. In 
other words, seeds should not be expected to be able of photosynthesis! 
This is exactly what the reconstruction of language evolution is all about 
and what great ape researchers and linguistics are expected to do: better un- 
derstand the path that bridges our last great ape common ancestor and us. 
A possible parsimonious hypothesis that integrates these two lines of 
work could be as follows. Initially, proto-consonants and proto-vowels 
were produced and used either separately (as observed in all great apes 
today) or in relatively simple syllable-like call combinations (as observed 
in wild orang-utans today). The overall acoustic range of proto-consonants 
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and proto-vowels was most probably much more limited than present day 
forms. With increasing selective drive for effective and efficient (social) 
communication, the use of these two proto-elements recruited an ancient 
mammal behaviour — fast paced mouth oscillations as argued in the Frame- 
Content Theory (MacNeilage, 1998) — to increase their production rate 
(Ghazanfar et al., 2012, 2013). Orang-utan vocal data suggest this recruit- 
ment may have involved a seven-fold increase in the delivery rate of succes- 
sive consonants and vowels (Lameira et al., 2015). Once stringed together 
by fast open-close mouth alternations, the acoustic features of (voiceless) 
consonants and (voiced) vowels started to fuse. Voiced-ness lost, then, its 
signature role in dividing consonants and vowels. As a consequence, what 
was a stark division in terms of voiced-ness, became a graded one based 
on degrees of vocal tract openness. 

While tentative for now, some support for this possible scenario is found 
in everyday discourse, notably, paralinguistic elements of human vocal 
communication. Examples of these sounds are, for instance, “Shhh” to 
demand silence at the start of a movie, “Mmmmm” to approve mother’s 
cooking, “Ahhhhh” when we finally get the solution for a difficult quiz. 
Articulatorily and acoustically, they correspond to proto-consonant and 
proto-vowel sounds, and unlike typical consonants and vowels, they are 
not stringed together to form of a word or a sentence. Paralinguistic sounds 
seem to represent, therefore, relicts of former stages in language evolu- 
tion when proto-consonants and proto-vowels were used separately, as 
observed in great apes. One must still explain the occurrence of voiced 
consonants as paralinguistic sounds, such as “Mmmmm.” According to 
the timeline mentioned above, these sounds should have only emerged once 
proto-consonants and proto-vowels were stringed together. How could 
voiced consonants, then, be used separately today? The key word here 
is today. Modern humans have developed a tremendous degree of vocal 
control (Gokhman et al., 2017), demonstrated excellently by any opera 
singer or beat-boxer and the cultural variation in speech sounds across the 
world’s languages. We can, today, deploy many different types of sounds 
communicatively in many different forms. No hominid ancestor is expected 
to have exerted such degree of vocal control. 

This example illustrates that the endeavour of reconstructing language 
evolution will be one of “likely” scenarios and parsimony, not one of ab- 
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solutism. Researchers and everyone interested in this fascinating topic will 
have to remain opened to the new data coming in, both from linguis- 
tics and comparative great ape vocal research, regularly check its axioms 
and predictions, and foremost, exercise time-perspective taking across the 
9,950,000 years gap with our hominid great (ape) grand parents. 


6. Concluding remarks 


This chapter proposes a new heuristic framework for language evolution. It 
explains how consonant-like and vowel-like calls are present in great apes 
and proposes that these calls can serve as models to study ancient proto- 
consonants and proto-vowels that existed in a human ancestor. With this 
proposal in place, we can start investigating for the first time questions 
that have remained, thus far, unformulated. For instance, why did proto- 
consonants and proto-vowels came together to compose the first syllable or 
word (Lameira et al., 2017)? When did this occur? Where there particularly 
ecological conditions that made these combinations more prone to occur? 
The latest evidence reviewed in this chapter stands as proof that a renewed 
interest on great ape behaviour will yield important clues as we progress 
in reconstructing language evolution. 
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Comparative Anatomy of the Baboon and 
Human Vocal Tracts: Renewal of Methods, 
Data, and Hypotheses 


Abstract: This chapter focuses on the emergence of speech during human evolution, 
revisiting exaptation hypotheses (Fitch, 2010; MacNeilage, 1998) with new data from 
comparison with baboons. Speech necessarily evolved to be compatible with aero- 
digestive anatomy, reusing its functions of suction, chewing and swallowing. The tongue 
is involved with every feeding gesture, and also has a central position for speech. We 
analyze the evolution of the tongue position taking into account the distinction between 
the morphogenetic fields of HOX and non-HOX genes involved in the development of 
the pharyngeal arches and the cephalic structures, anatomical and neurological compo- 
nents, and functional support for breathing and swallowing. The hyoid bone is the locus 
of insertion of the tongue muscles as well as a precise marker of the glottis position. It 
is not fixed because it partly depends on the development of the facial area controlled 
by non-HOX genes. In contrast, the vertebral column has stable dimensions because it 
is controlled by HOX genes. After a detailed presentation of a baboon head dissection, 
we present a new method for mapping hyoid bone position relative to the vertebral 
axis, applied to MRI images. This is compared to a set of radiographs of 3-7.5 year 
human children. We observe that the hyoid bone is 1 vertebra lower in human infants 
than in adult baboons. The normalized oral cavity length is shorter, in agreement with 
prognathism reduction as controlled by non-HOX genes. Using the cervical vertebrae 
and their axis as a reference allows the conclusion that there is indeed laryngeal descent 
from baboons to humans and that it is accompanied by compensatory facial shortening. 
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This preserves the vocal tract length as well as the relationship between the tongue and 
the oropharyngeal cavity, which is important for swallowing and other feeding gestures. 


Keywords: exaptation, baboons, vocal tract anatomy, HOX genes, laryngeal descent 


1. Introduction 
1.1 Why link speech emergence and primate vocalizations? 


The existence of speech as a characteristic of the human species raises a 
series of questions that, for the most part, have remained open and unan- 
swered for several centuries. What are the anatomical and cognitive pre- 
requisites for vocal communication? When, where, and how did this type 
of communication arise? By what steps has this evolution taken place? Did 
gestural communication originate earlier? Or did gestures and vocalizations 
arise simultaneously? 

Researchers have at their disposal human fossils which, though rarely 
complete, do allow us, to some extent, to trace the anatomical evolution of 
the head and neck, and thus the architecture of the vocal tract. Obviously, 
there are no recordings of their sound productions. 

Already by the second third of the 19" century, Youatt (1835) had 
trouble understanding why the chimpanzees lacked the power to speak 
while they were able to shout loudly. We can understand why the anat- 
omy of the vocal organs of chimpanzees has since then aroused great 
interest (Vrolik, 1841), but what explains that with very similar organs, 
these primates cannot use them in the same way humans do? More gen- 
erally, for insights into the evolution of the cerebral cortex and cogni- 
tion in human ancestors, researchers have long studied the comparative 
anatomy of the chimpanzee brain (Clark et al., 1936; Falk, 2014; Walker 
and Fulton, 1936). 

Since we share common ancestors with both apes and monkeys we 
hypothesize that the current vocalizations of these primates provide us 
with an underexploited window for exploring the nature of speech, and 
can inform us about the stages of its emergence. Indeed, we assume that 
the system of speech communication was gradually established over the 
course of the millions of years of evolution that separate us from our 
common ancestors. Animal communication has evolved on several levels: 
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anatomical, cognitive, ethological, all under the constraining influence of 
the environment. 

On the other hand, the other descendants of these common ancestors 
would not have followed the same evolution. We can therefore assume that 
their vocalizations have changed little. The vocalizations of present-day 
monkeys would thus be relics (Pisanski et al., 2016) of earlier vocal tract 
abilities and, we could say metaphorically, fossil traces of the communica- 
tion of our common ancestors. 

Monkey and ape vocalizations depend on the sex, status, and age of the 
analyzed individual. Among primates, baboons produce a repertoire of 
fourteen vocalizations identified and associated with situations ethologi- 
cally described (including behavior and communication) (Hall and DeVore, 
1965; Zuberbühler, 2012). There are several acoustic analyzes of baboon 
vocalizations (e.g. Andrew, 1976; Fischer et al., 2002; Owren et al., 1997; 
Rendall et al., 2005) and more recently it has been shown that they can 
produce five differentiated vocalizations corresponding to five different 
ethological situations (Boé et al., 2017). 


1.2 The exaptation hypothesis 


This chapter does not focus on the acoustic analysis of baboons vocaliza- 
tions but rather on the anatomical aspects that enable and condition the 
production of these vocalizations, that is to say on the anatomy of the 
larynx and on the vocal tract and its position with respect to the larynx 
and to the cervical vertebrae. 

Indeed, the vocalizations of mammals, and thus of human and non- 
human primates, are all produced by the same process. The sound generated 
by the vibration of the vocal folds (the source) is acoustically filtered by the 
resonance characteristics of the vocal tract (the filter), that extends from 
the glottis (the gap between the vocal folds) to the lips which radiate the 
filtered signal: this is the source-filter theory (Fant, 1960). By controlling 
the action of the vocal folds, by modifying the vocal tract shape through 
control of the articulators (tongue, mandible, lips), or by engaging the nasal 
passages (through lowering of the velum) it is thus possible for humans to 
articulate sufficiently differentiated vowels and consonants and to combine 
them in syllables and syllable sequences. 
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The question, then, is whether anatomical reasons explain why pri- 
mates would not be able to produce differentiated vocalizations. The 
question arises all the more so since for almost 50 years a widespread 
and longstanding theory (Lieberman et al., 1969; Lieberman, 1975, 1984, 
1998, 2007, 2015) has claimed that nonhuman primates, including pre- 
modern hominids, were incapable of producing systems of vowel-like 
sounds involving control of their vocal tract, due to their high larynx 
position and the resulting articulatory anatomy. 

The comparative study of the anatomy of the upper aero-digestive 
tract of Papio papio and Papio anubis baboons and humans reveals 
similarities and differences. The first difference is the transition to the 
upright posture, which caused the centering of the foramen magnum, 
and triggered reductions in prognathism and the weight of the face. 
There is therefore a modification of the aerodigestive crossroads at the 
level of the epiglottis which ends up in a lower position and which is no 
longer in contact with the soft palate in humans. The second difference 
is the less flexed skull base, which has the biomechanical consequence 
of modifying the position of the hyoid bone (Reidenberg and Laitman, 
1991). (Note that skull base flexion is measured as the angle of the 
orbital plane with that of the foramen magnum; the increased flexion 
in humans indicates a ventral displacement of the foramen magnum to 
accommodate upright posture.) 

Hence, the observed human-baboon similarities are constrained by the 
primary functions, but the critical point is to consider these similarities as 
the true prerequisites to the production of speech. In contrast, many authors 
assumed that the differences are the markers of limits on the speech produc- 
tion ability. In this context, we will describe the evolution of the laryngeal 
elements (vocal folds, cartilages, thyro-hyoid membrane), the hyoid bone 
and the oral elements (tongue, palate). We will use a common reference 
frame for vertebrates based on the axis of the cervical vertebral column, 
with the apex of the odontoid as its origin. This landmark is under the 
control of the HOX development genes, as are the hyo-branchial apparatus 
and the larynx (Figure 1). 


Comparative Anatomy of the Baboon and Human Vocal Tracts 105 


Figure 1: HOX and non-HOX zones delimited by anatomical landmarks on 
the skeleton. Insets: top, Drosophila, which is an important model for 
understanding body generation, and below, mouse and human embryos, 
which present HOX genes. 


Non-HOX 
gene zone 


After the first 15 to 20 days following fertilization in vertebrates (Couly 
et al., 1993; Couly and Bennaceur, 1998; Couly et al., 2002), the HOX 
genes are responsible for embryo development and determine its anterior- 
posterior and dorsoventral organization, and thus the placement of the 
base of the skull, the head, and the body. Consequently, these genes are in- 
volved in the growth of relevant bones, which form the framework in which 
the vocal tract is situated. This system is highly conserved in vertebrates 
(McGinnis et al., 1984) and it can be assumed that this regulation maintains 
a suitable morphology for swallowing and protecting the airways. 
Conversely, located in non-HOX areas, the oral part of the vocal tract 
is considered variable. This oral part, derived from the first pharyngeal 
arch, is under the control of a variety of genes and must be negative HOX 
for normal development (Chai and Maxson, 2006; Kuratani et al., 1997). 
Speech would have evolved through possibilities and constraints external 
to speech: “speech from nonspeech” (MacNeilage and Davis, 2000a, 2000b), 
hence the interest in finely analyzing the anatomical structures of the vocal 
apparatus of non-hominin primates, because they are likely to enlighten us 
regarding the path followed during the emergence of speech. Thus, gestures 
of the tongue, the mandible, and the lips were compared across feeding and 
speech production (Green and Wang, 2003; Hiiemae, 2000; Hiiemae et al., 
2002; Hiiemae and Palmer, 2003; Serrurier et al., 2012). Part of the control 
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might also have been exapted (for discussion, see Ballard et al., 2003; Bunton, 
2008; Folkins et al., 1995; Martin, 1991; Ziegler, 2003). 

The vocal tract’s original and still primary function is digestive. It is 
divided into two main parts that evolved with their own constraints and 
their own regulatory genes. The anterior part is dedicated to feeding, with 
suction and chewing as well as swallowing, and the posterior part is mainly 
related to swallowing. This chapter revisits the hypothesis of exaptation 
(Gould and Vrba, 1982) of speech from tongue anatomy as well as from 
these functions in several ways. First, speech gestures may be derived from 
feeding gestures. For example, suction and lip rounding are related. Second, 
they can reuse the existing anatomy. For example, the ability of the tongue 
for swallowing, which guides food from the anterior to the posterior, is 
related to its musculature. For speech, this permits constrictions inside 
the vocal tract at well-controlled positions. Third, the skill at chewing a 
variety of foods has an impact on the agility of tongue, as well as on the 
development of oral somatosensory perception and feedback necessary for 
speech. We now continue with an anatomical description of these anterior 
and posterior components of the vocal tract, followed by a quantitative 
analysis of their evolution from baboon to human. 


1.3 The central position of the hyoid bone 


The functional requirements of vocalization involve mobilization of the air 
source, used for breathing, and of the vocal folds, which protect the airway dur- 
ing swallowing. Unlike chimpanzee or other mammalian larynges (Harrison, 
1995; Kelemen, 1969), and contrary to general anatomy (Swindler and Wood, 
1973), the baboon hyoid bone and larynx seem, with a few minor exceptions, 
to have been relatively little studied (Nishimura, 2003ab, 2005, 2006). 

The functional imperatives of swallowing and breathing require a close ana- 
tomical relation between the oral cavity, the base of the tongue, the pharynx, 
and the larynx. Presumably, “spatial constraints related to deglutition impose 
greater restrictions on the rate and degree of hyo-laryngeal descent than do ad- 
aptations for vocalization” (Lieberman et al., 2001). The oral and pharyngeal 
phases in swallowing allow passage of a food bolus from the oral cavity to the 
esophagus while protecting the airways. The position of these anatomical struc- 
tures is determined by their insertions on the skeleton, especially on the base 
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of the skull, the mandible, and the hyoid bone. The positions will be identified 
relative to the cervical spine (mainly C2, C3, and C4), which has been shown 
to be highly similar between baboons and humans (Tominaga et al., 1995). 
Comparing baboons with humans reveals major morphological differ- 
ences, in the less flexed base of the skull and in the face, that are associated 
with the arrangement of the muscle insertions and in the position of the hyoid 
bone in the baboon. It has been established that prognathism involves differ- 
ences of the insertions of the muscles of the tongue and supra-hyoid muscles. 


Figure 2: Functional anatomy of the components involved in swallowing, 
vocalization, and articulation for speech. The two main composite 
structures, the vocal tract (left) and the larynx (right), are at the top, with 
their specific anatomical components immediately below, with explicit ties 
to their functions at the bottom. This shows the great overlap between 
structures for speech and swallowing. Note that the hyoid bone is a key to 
this overlap because it is the insertion of the tongue root at the same time as 
its anatomical relationship with the epiglottis is important for swallowing. 
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pharynx larynx 
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The hyoid bone plays an important role in the functional anatomy of swal- 
lowing, vocalization, and speech production, since it supports the base of 
the tongue (Figure 2). It is an isolated and fragile bone, of which only a 
few fossilized specimens have been found: a complete Neanderthal hyoid 
(Kebara2, 60 kya), a partial Neanderthal hyoid (Asturias, Spain, 43 kya), 
two hyoids assigned to Homo heidelbergensis (Sierra de Atapuerca, Spain, 
530 kya), and a complete “chimpanzee-like” hyoid assigned to Austra- 
lopithecus afarensis (Dikika, Ethiopia, 3.3 mya). These discoveries have 
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renewed the interest in having such fossils for the debate on the origin of 
speech (Alemseged et al., 2006; Arensburg et al., 1989, 1990; D’Anastasio 
et al., 2013; Martinez et al., 2008; Rodriguez et al., 2003). 

Indeed, the position of the hyoid bone relative to the cervical vertebrae 
has varied during phylogeny, and it also varies during ontogeny (Lieber- 
man et al., 2001), both in humans and in non-human primates (Nishimura, 
2006). The position of the hyoid is an important indicator to consider, 
perhaps more than that of the larynx itself, because it anchors the tongue 
root. Its position was considered as a marker of the speech production 
ability, according to the laryngeal descent hypothesis (Fitch, 2010, p. 312). 


2. Descriptive anatomy of baboon vocal tract 


An accurate comparison of the anatomy of the larynx and of the tongue 
musculature of baboons and humans is crucial for the discussion of the 
origin and evolution of speech, considering the crucial role played by these 
organs in speech. However, such comparative anatomy has been insuffi- 
ciently described, notably less than for humans vs. chimpanzees (Hofer et 
al., 1990; Swindler and Wood, 1973; Takemoto, 2008). 

The present description was based on two adult Papio papio heads, from 
one male and one female who died naturally in the UPS CNRS Primatology 
Station, Rousset, France, where various monkeys, including baboons, are 
kept. The two baboon heads were scanned at the Montpellier CHU in bone 
fenestration (General Electrics, cut 0.5 mm) when fresh, then sectioned in 
the strict median sagittal plane when frozen at the anatomy laboratory in 
Montpellier. Thawing was done in 10% formalin to perform the dissection 
that was conducted with binocular loupes in both Papio papio specimens. 


2.1 General description of the vocal tract and larynx 


The anatomical relations of the pharynx and the larynx are shown on a 3D 
reconstruction of a male Papio papio skull (Figure 3) incorporating a 3D 
reconstruction of the hyoid bone and the larynx (Figure 4). All 3D reconstruc- 
tions are performed with Myrian® software. Additionally, a 3D reconstruc- 
tion of the airways of the female Papio papio is superimposed on a sagittal 
section (Figure 5), and a zoomed portion of the larynx is compared to the 
median sagittal section (Figure 6). Details are provided in the figure captions. 
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The anatomy of the oral part of the baboon vocal tract is observed us- 
ing a medial sagittal section of the female head (Figure 7). It is essentially 
equivalent to that of the human in its basic elements, but not in its propor- 
tions, as the tongue is longer and lower, the hard palate is flatter and the 
velum (soft palate) more horizontal. In humans, the genioglossal muscle 
(GG) has three groups of fibers (GGa, GGm, GGp) that form a fan-shaped 
structure with an anterior GGa vertical (Testut, 1897). In the baboon, the 
GG has the same structure with GGa composed of vertically oriented fibers 
and few or no fibers towards the tip of the tongue. The anterior portion 
of the tongue is proportionally larger, in concordance with the progna- 
thism. The insertion on the superior mental spine is done by means of an 
aponeurosis on which the muscular fibers are fixed. The geniohyoid muscle, 
which is not a tongue muscle per se, is highly developed and accounts for 
more than half the tongue height. The styloglossus and digastric muscles 
were found at dissection to have a more horizontal orientation than those 
of humans. The hyoglossus presents, as in humans, two components, one 
inserting on the body of the hyoid bone and the other along the full length 
of the great horn. 


Figure 3: 3D reconstruction (Myrian® software) of a male Papio papio skull. Left, 
the skeletal profile shows the entire mandible with a strong ramus but a 
minimally pronounced coronoid process and mandibular notch. Right, the 
posterior part of the mandible has been digitally removed after segmentation 
to show the hyoid bone, the larynx and the trachea. The lowest projection 
of the hyoid bone occurs above the lower edge of the mandible. The body 
of the byoid bone is located forward and above the thyroid cartilage with a 
portion that descends forward. The hyoid’s greater horns follow the upper 
border of the thyroid cartilage towards the thyroid’s upper horns. 
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Figure 4: Comparing structure across species. Left panel, 3D reconstruction (Myrian® 
software) of the hyoid bone and the larynx of the male Papio papio. In this 3⁄4 view, 
we find that morphology of the hyoid bone is grossly similar to that of the human with 
a median body, two greater horns and two lesser horns. The body of the hyoid bone 
has an inferior enlargement situated in front of the thyroid cartilage. The view of the 
larynx reveals a thyroid cartilage that is longer than it is high. It articulates with the 
cricoid cartilage and one can infer the position of the arytenoid cartilages. Right panel, 
in comparison to humans (right), the baboon larynx (left) shows that the thyroid 
cartilage is inset into a relatively larger hyoid bone, with the body enlarged inferiorly 
and total loss of the thyrohyoid membrane. This disposition is not found in great apes 
such as chimpanzee and gibbon, but it is found in the stump-tailed macaque, Macaca 
arctoides, and the white fronted capuchin, Cebus albifrons (Nishimura, 2003ab, 2005, 
2006), shown in the inset (respectively left and right). 


Figure 5: Sagittal section and 3D reconstruction (Myrian® software) of the airways of 
the female Papio papio. The airways were detected semi-automatically. The outlining 
of the upper airways and of the air in the oral cavity allows close study of their 
morphology. From behind the hyoid body and in front of the thyroid cartilage, the air 
sac and its conduit (circled) communicate with the larynx through a passage between the 
two cartilaginous plates of the thyroid cartilage. We see that the skull base is flattened: 
the angle shown is greater for all monkeys and apes than it is for humans. The axes of 
the hard palate and of the cervical vertebrae, both used in this study, are also displayed. 
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Figure 6: Laryngeal details of Papio papio. Left: The anatomical medial sagittal 
section of the larynx for a female Papio papio shows the hyoid bone (Hy) with the 
body anterior to the thyroid cartilage (Th). The air sac (AS) is between the two. 
The vocal folds (VF) are positioned at the middle of the thyroid cartilage’s vertical 
dimension. The baboons’ vocal folds measure 16.5 mm for the male and 11 mm for 
the female, in the same range as those of adult humans (Roers et al., 2009). The cricoid 
cartilage (Cr) is also found with its posterior part relatively high, as is the case for the 
epiglottis (E). The supraglottal portion of the larynx is very short. Right: Enlargement 
of the 3D reconstruction (Myrian® software) of the airway of the female Papio papio 
showing the connection from the air sac (AS) to the larynx, with the larynx divided into 
supraglottic (SG) and infraglottic (IG) portions. Connection is at the level of the glottis, 
and presents small laryngeal sacs (LS) laterally. The impression of the tongue (To) 
determines the oro-pharyngeal space that communicates with the piriform recess (PR). 


Figure 7: Anatomical medial sagittal section of a female Papio papio. The soft palate 
or velum (V) is at rest and disengaged from the pharyngeal wall (Ph), with the epiglottis 
(E) in contact with the uvula at the level of the atlas (C1). The anterior (GGa), middle 
(GGm) and posterior (GGp) parts of the genioglossus (GG) muscle of the tongue (To) 
are clearly discernible. The geniohyoid muscle (GH) is inserted from the symphysis 
(Sy) to the hyoid bone. (N.B. the left panel of Figure 6 is enlarged from this figure.) 
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2.2 Tongue musculature and consequences for vocalization 


The tongue musculature in baboons was examined by a dissection pro- 
tocol similar to that used in humans. It appears that tongue musculature 
is structurally similar in humans and baboons, with the styloglossus and 
the three parts of the genioglossus, although the external shapes differ: the 
baboon tongue is flatter while the human tongue is rounded. The muscular 
hydrostat theory of the tongue shape suggests that, as in chimpanzees, 
the primary actions available to the baboon tongue are protrusion and 
retraction (Takemoto, 2008). In addition, the extrinsic muscles raise the 
back of the tongue through the action of styloglossus, while jaw opening 
lowers the back of the tongue along with the mandible. This confers to 
the baboon tongue the necessary degrees of freedom of movement re- 
quired for swallowing (Crompton and German, 1984; Green and Wang, 
2003; Hiiemae, 1967, 2000; Hiiemae and Crompton, 1985; Hiiemae and 
Palmer, 1999; Hiiemae et al., 1995; Martin 1991; Serrurier et al., 2012), 
which can then be used to articulate distinctive vocalizations combining 
two axes (Figure 8). Taking into account the length and flat configuration 
of the hard palate, it is not clear whether the baboon is capable of sounds 
such as /i/, which in humans require a long apical constriction along the 
alveopalatal area. 

Overall, these considerations are nonetheless compatible with exapta- 
tion. Boé et al. (2017) discussed how “[t]he baboon’s muscle fiber orienta- 
tion allows tongue motion along two main axes.” Antagonistic activation 
of the GGam, and SG tongue muscles produces changes in the vocal tract 
allowing both a front/back contrast homologous to the human [æ] © 
[u] along the first axis and the posterior constriction needed for [u]. The 
GGp and HG tongue muscles produce homologs to the human [a] © [i] 
contrast through vertical tongue displacement along the second axis. These 
two axes do have different orientations in baboons and humans, due to the 
differences in the inclination of the styloglossus and the shape of the tongue 
(Badin and Serrurier, 2006; Buchaillard et al., 2010; Denny and McGowan, 
2012; Honda, 1996). 


Comparative Anatomy of the Baboon and Human Vocal Tracts 113 


The laryngeal descent hypothesis asserted that human-like speech pro- 
duction was not possible because of the lack of a posterior cavity allowing 
a second independent axis: 


This lower, pharyngeal portion of the vocal tract provides a whole new dimension: 
by moving the tongue backwards and forwards this lower tube can be indepen- 
dently modified (this arrangement is thus dubbed a “two-tube” tract) ... Thus, the 
descent of the tongue root and larynx provides an additional degree of freedom, 
a new dimension of control, compared to the capabilities inferred for a normal 
mammalian tract. (Fitch, 2010, p. 312, italics in the original). 


We disagree, and have concluded that the second axis is not related to la- 
ryngeal descent or to the increase in the posterior cavity, but to the tongue 
muscle structure itself. 


Figure 8: Human (left) and baboon (right) muscles and axes producing vowel 
contrasts. Note that the main axis of the styloglossus muscle (SG) is 
nearly horizontal in the baboon, with an action resulting in tongue 
retraction rather than elevation. 


axis 1 
front/back 


3. Quantitative comparative anatomy of baboons and 
humans 


We now proceed to quantitative analysis of anatomical landmarks associ- 
ated with the hyoid bone, glottis, prosthion, palatal plane, oral and phar- 
yngeal cavities, and the C2, C3, and C4 vertebrae. We are looking for an 
adapted anatomical description by: (1) measuring the length of the vocal 
tract; (2) decomposing it into an oral part (oral cavity length, OCL) anda 
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pharyngeal part (larynx height, LH); (3) locating the glottis and the hyoid 
relative to the cervical vertebrae; and (4) recapitulating these observations 
by age and sex for comparison of baboons with human females, males, 
and children. 

Vocal tract length, measured from the glottis to the lips, depends on the 
bony structure, soft parts, and their interactions during growth (Scammon, 
1930). The vocal tract is composed of an oral part, from the lips to the pha- 
ryngeal point (defined below), and a pharyngeal part, from the pharyngeal 
point to the glottis. In humans, the growth of these parts is heterochronic 
and there is sexual dimorphism: during puberty, the pharyngeal part devel- 
ops more than the oral part, especially in males (Goldstein, 1980). 

Importantly, in modeling, vocal tract length happens to be a key param- 
eter for the estimation of potential capacities for production of formant 
resonances in humans through constrictions and cavities (Boé et al., 1989; 
Bonder, 1983; Liljencrants and Lindblom, 1972) and also for characterizing 
the acoustic structure of vocalizations in nonhuman primates (Boé et al., 
2017; Riede et al., 2005). 


3.1 Vocal tract biometry of baboons from 3D MRI scans 


Fifty-six 3D MRI scans of Papio anubis baboons, consisting of 23 males 
and 33 females from 2 years to adulthood (at least 6 years), were performed 
at La Timone Hospital in Marseille. For the MRI scans the baboons were 
sedated. From the MRI files, a multiplanar analysis with OsiriX® Lite 
software (a DICOM viewer) served to locate the median sagittal plane. To 
do so, we first located median structures in an axial plane: we display in 
Figure 9, in front, the superior mental spine (EM; note that spine is epine 
in French), and in back, the middle of the vertebrae on the axial section. 
The median sagittal section passes through the mandibular symphysis at 
the level of the insertion of the genioglossus on the superior mental spine. 
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Figure 9: Reference sections of the baboon head. The median sagittal plane must 
be determined using three points across which there is general bilateral 
symmetry. The plane thus determined invariably sections the hyoid. The 
final settings and sections are shown here. 


In the median sagittal plane, we located the following landmarks, using 
a program written in Matlab® (see Figure 10): the apex of the odontoid 
(Od), the posteroinferior edge of the vertebral bodies of C2, C3, and C4, the 
anterior nasal spine (ENa), the posterior nasal spine (ENp), the prosthion 
(Pr), the upper mental spine (EM), and the top of the upper edge of the 
hyoid bone (Hy). The following other references are obtained by construc- 
tion (semi-landmarks): the occlusal plane (which is parallel to the palatal 
plane ENa-ENp) is evaluated and the pharyngeal point (PPh) is defined as 
the intersection of the occlusal plane with the pharyngeal wall in the sagit- 
tal plane. (Note that the occlusal and palatal planes are presumed perpen- 
dicular to the sagittal plane.) The position of the larynx in relation to the 
cervical spine changes with flexion and extension of the neck (Reidenberg 
and Laitman, 1991; Westhorpe, 1987). 
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Figure 10: Hyoid positioning under head flexion. As explained in the text, a circle 
was struck at the hyoid position, Hy, with Od at the center. A correction 
based on the angle of Od-C4 and the palatal plane was then applied to 
correct for head inclination and find the corrected hyoid position, Hyc. 
The projected hyoid position, Hyp, was at the orthogonal projection 
of Hyc on the Od-C4 line. For adult humans Hyp is consistently found 
at the level of C4, regardless of the angle of head inclination. In the 
right-hand panel, we note that the angle is close to 90 degrees, so only 
a minimal correction is necessary from Hy to Hyc. 


In previous primate studies the inclination of their head relative to the 
cervical vertebrae has not always been well controlled, which has allowed 
vertical displacement of the hyoid bone according to the inclination of 
the head. To correct such differences, we adopted a new procedure and 
validated it first for humans. Working from a profile view, it consisted in 
locating the palatal plane (using the anterior and posterior nasal spines ENa 
and ENp) and then in drawing a circle centered at the apex of the odon- 
toid (the estimated center of rotation) and passing through the top edge of 
the body of the hyoid bone. The angle between the palatal plane and the 
vertebral column (defined as Od-C4), which is compared to a theoretical 
angle of 90° (to make comparisons with humans), is then measured. The 
difference is the correction angle applied to the hyoid bone and after rota- 
tion (Hy => Hyc), this point Hyc is then projected orthogonally onto the 
vertebral column to get the final landmark Hyp (Figure 10). The procedure 
was tested on different radiographs and partly on MRIs of vowels /i a u/ 
pronounced by a French speaker. The position of Hyp along the Od-C4 
axis was found constant. We do not apply this procedure for finding the 
position of the glottis, which descends slightly with the inclination of the 
head, but the larynx is generally much more stable than the hyoid bone. 
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In order to compare vocal tract and other anatomical dimensions across 
species, ages, and subjects, we have normalized all distances based on the 
distance between the apex of the odontoid and C3 inf (dOC3). Thus, we 
analyze the variations in vocal tract dimensions with respect to cervical ver- 
tebrae and their axis taken as a reference. The normalized measures will be 
expressed as ratios without units, and the position of C3 inf is always taken 
as 1. This choice is motivated by the consideration that the vertebral axis 
should be a stable reference in evolutionary terms because of its location 
in the HOX zone (Figure 1), which is known to have undergone few muta- 
tions over a very long time, including the period of mammalian emergence. 
Conversely the oral part of the vocal tract, located in the non-HOX areas, 
is considered variable, as well as its pharyngeal and laryngeal parts which 
are influenced by the hyoid bone position. The normalization operation 
makes it possible to quantify these variations, under the assumption that 
the contribution linked to isometric growth of the vocal tract cavities and 
the cervical vertebrae is thus suppressed, keeping only the relative variations 
with respect to the vertebral axis. The size differences between subjects are 
of course eliminated at the same time. This differentiates our study from 
those of the Japanese chimpanzee and macaque (Nishimura 2003a, 2003b, 
2005, 2006; Nishimura et al., 2003, 2008). This approach allows us to 
evaluate laryngeal descent relative to the odontoid along the vertebral axis, 
and also indirectly relative to the palate. 
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Figure 11: Landmarks used to characterize baboon vocal tract geometry (15 year 
old adult female): Prosthion Pr; Pharyngeal Point PPh (located at the 
intersection of the occlusal plane and the pharyngeal wall); Glottis Gl; 
Hyoid Hy, corrected Hyoid Hyc, and projected Hyoid Hyp; Odontoid 
Od; inferior edges of the cervical vertebrae C2, C3, C4; upper mental 
spine EM; and anterior and posterior nasal spines ENa and ENp. 
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Figure 12: Biometric results for baboons. (See text for discussion of normalized 
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After normalization, the distribution of C2 inf position is narrow (Fig- 
ure 12.1), providing a precise landmark. The distance between C2 inf and 
C3 inf (mean = 0.39) is useful for defining a metric expressed in fraction 
of a vertebra, knowing that one vertebra thus defined includes both the 
body and the intervertebral space. The distribution of (normalized) ver- 
tebral projections of the hyoid is bimodal with the main peak around C2 
inf and a second smaller peak somewhat higher on C2 (Figure 12.2). The 
distribution according to age shows a clear increase which is well fitted 
with a sigmoid function (Figure 12.5) and the smaller peak apparently 
represents the young baboons. This is well shown thanks to Gaussian 
modeling (Figure 12.4) applied after decomposition in two age groups 
(<= 6 years and > 6 years): the main peak corresponds to adults (mean 
= 0.64) and the smaller to young baboons (mean = 0.44), the narrow 
Gaussian indicating the position of C2 inf at 0.61. Using these means for 
the 2 age groups and our estimate above of a standard vertebra, we can 
estimate the amount of hyoid descent as (0.64-0.44)/0.39=0.51 vertebra. 
The glottis histogram is apparently monomodal, but we see some varia- 
tion by age that we fit with a partial sigmoid function (Figure 12.6). This 
is also decomposed in two groups with Gaussian modeling (Figure 12.4) 
having means at 0.97 (<= 6 years) and 1.1 (> 6 years). This is around C3 
inf, and the distance between the hyoid bone projection and the glottis 
projection is approximately equal to one vertebra for both groups. The 
glottis descent, (1.1-0.97)/0.39=0.33 vertebra, is less than half a vertebra. 
Note that our determination of the location of the hyoid bone position 
is more precise than for the glottis, which is, ultimately, an empty space. 
Moreover, we do not apply any correction to the glottis position. Thus, 
the hypothesis of having approximately the same descent for hyoid bone 
and glottis, about half a vertebra, is reasonable. In summary, we quantify 
the process of laryngeal descent from baboon childhood to adulthood in 
the following manner: from the middle of C2 to C2 inf for the hyoid, and 
from the middle of C3 to C3 inf for the glottis, with a constant distance 
of one vertebra between hyoid and glottis. We have also found that dOC3 
does not vary much with age (data not shown). 

The oral cavity length (OCL) is normalized similarly in order to see 
if developmental laryngeal descent in baboons is associated with an in- 
creased prognathism. The histogram of OCL is monomodal with a peak 
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at about 3 (Figure 12.7). In other words, the distance between the pros- 
thion and the pharyngeal point (resp. Pr and PPh in Figure 11) is on 
average about 3 times the distance between the glottis projection (cen- 
tered on C3 inf) and the odontoid. The distribution of OCL according to 
age is divided in two groups (young and adults with limit at 6 years) to 
show a small increase in prognathism (Figure 12.8). Complementarily, 
the larynx height index (LHI) (Honda and Tiede, 1998) is defined as the 
ratio of the OCL and laryngeal height, itself defined as the pharyngeal- 
glottal distance (Figure 11). It averages a constant 0.43 across age in 
baboons (Figure 12.9), because it is a ratio between two similarly increas- 
ing values. In contrast, LHI varies from 0.5 at birth to 1.0 at adulthood 
for humans, since there is no increase in OCL while over time there is 
laryngeal descent. 


3.2 Vocal tract biometry of human children, from radiography 


The position of the hyoid bone and larynx in children has been reported 
in various studies (Amayeri et al., 2014; Coelho-Ferraz et al., 2006, 2007; 
Grant, 1965; Westhorpe, 1987). Our own radiographs of children were 
obtained as part of the SkullSpeech project (Perrier and Boé, 2009-2012). 
We also obtained radiographic data from M. J. Deshayes (127 children, girls 
and boys from 3 to 7.5 years, mean 5.2 years, standard deviation 0.95 year), 
an age range in which there is no sexual dimorphism. 

We analyzed the radiographs with the same general procedure as for ba- 
boons, adding C4 inf, but eliminating the glottis, which was not visible. We 
applied the same normalization with dOC3. It appears that the distributions 
of both C2 inf and C4 inf are narrow. The hyoid projection is between the 
two, near C3 inf. The mean OCL value is 1.85, and the ratio of OCL be- 
tween young baboons and infants is 3/1.85=1.62 indicating the high degree 
of prognathism in young baboons. The variations of both hyoid position 
and OCL seem minimal around 5 years of age. LHI cannot be measured 
since it requires glottis position, and the glottis was not visible. 
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Figure 13: Biometric results for human children. (See text for discussion of 
normalized measures.) (1) Positions of C2 inf and C4 inf, while C3 inf 
is at 1 by definition; (2) Hyoid bone vertebral projection, Hyp, mean 
= 1.08 centered on the space below C3 inf; (3) Gaussian models of 
distributions of the positions of C2 inf and C4 inf and of the hyoid 
projection, Hyp; (4) Oral Cavity Length (OCL), mean=1.85; (5) Hyoid 
projection, Hyp, across age, with no significant change; (6) OCL across 
age, with no significant change. 
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Figure 14: Position of hyoid and glottis compared to cervical vertebrae for young 
(< 6 years) and adult baboons, and in humans, for newborns (Westhorpe, 
1987; Barbier, 2010) 5 year-olds (present data; Barbier, 2010), and adult 
females (Barbier, 2010) and males (Westhorpe, 1987; Barbier, 2010). 
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Though we detail here only data for human children, our previous studies 
address growth patterns of the glottis, hyoid, and vertebrae in humans from 
childhood through adulthood (Barbier, 2010; Barbier et al., 2012, 2015). 
These studies show that in both males and females, there is considerable 
vertical growth during the first two years of life, then the growth of C3 
and the hyoid position stabilize between 3 and 8 years of age. Glottis de- 
scent appears highly correlated with those of C3 and the hyoid. At around 
10 years there begins a second surge of vertical growth, which only affects 
the hyoid and the glottis, C3 being extremely stable after the 8th year. This 
second surge, greater for the glottis than for the hyoid, seems to stabilize 
for the hyoid at about 15 years for women, but continues for the glottis 
until nearly 20 years for men (Barbier, 2010; Barbier et al., 2012, 2015). 
Figure 14 summarizes all these data in five diagrams comparing young 
baboons, adult baboons, and human children and adults, female and male. 
In all these diagrams we observe an offset of about one vertebra between 
the vertebral projections of the hyoid and the glottis. 

We showed in the previous section that the hyoid in the young baboon 
(less than 6 years old) projects to the level of the body of C2, and for 
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the adult baboon to the level of C2 inf. Our data show that lowering 
in baboons takes place in a single step, without any clear sexual dimor- 
phism. For comparison, the hyoid bone projects to the same level in the 
adult baboon, as in the human newborn (Barbier et al., 2012, 2015). 
However, the descent of the hyoid in humans takes place in two stages, 
and we found that in the 5-year-old child, the hyoid bone was around 
C3 inf (Figure 13). After adolescence, we estimate that the descent takes 
place down to C4 in females and C4 inf in males. It also appears that 
in baboons (Figure 12) as well as 5-year-old humans (Figure 13), and 
indeed in all cases (Figure 14), the glottis is situated about 1 vertebra 
below the hyoid bone. 

It can be suggested that there might be an underlying morphological 
invariant, namely hyoid — glottis distance, allowing the epiglottis to play 
its role of protecting the airways while maintaining a constant relationship 
between its top and the base of the tongue. This hypothesis is reinforced by 
the fact that these anatomical elements are all regulated by HOX genes. In 
contrast, the oral anatomy would have escaped from HOX control (Chai 
and Maxson, 2006), leading to greater changes including a decrease in 
prognathism associated with a caudal displacement of the tongue and an 
increase in verticality. The consequence would be a caudal hyoid — glottis 
translation relative to the spinal column, maintaining the distance between 
them to enable swallowing and to protect the airways. 


3.3 Vocal tract growth in humans and baboons 


In a previous study (Barbier et al., 2012, 2015) a grouping of four American 
Association of Orthodontists (AAO) archives was used to quantify (human) 
vocal tract growth. These records contain 966 sagittal X-rays of the head 
and neck for 68 white North American subjects (33 women and 35 men), 
obtained approximately every year between 1 month and 25 years in order 
to study longitudinal growth of the dentition. For baboons, we were able 
to retain only 25 of our 56 subjects, the lips being sometimes absent from 
the MRI. 
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Figure 15: Vocal tract length (VTL) measurement from glottis to lips: 10 landmarks 
positioned by hand and then joined with a spline curve. Baboon Papio 
anubis, male, 15 years, VTL = 13.2 cm. 


Figure 15 shows the method of VTL measurement used for the 25 baboons. 
The distance from lips to glottis is the length of a spline curve defined with 
ten points manually placed on the tongue surface. Figure 16 shows VTL 
growth obtained by fitting data with a double sigmoid function for hu- 
man females and males (Barbier et al., 2012, 2015) and, for comparison, 
measurements modeled by a simple sigmoid for female and male baboons. 
The choice between simple and double sigmoids stems from the fact that 
baboons reach their maturity around 6-7 years without an adolescent 
phase. For adult baboons, as for humans, there is sexual dimorphism, so 
the growth profiles are not homogeneous and diverge after few years (Fig- 
ure 16). 


126 F. Berthommier, L.J Boé, A. Meguerditchian, T. Sawallis, G. Captier 


Figures 16: Development of vocal tract length from fertilization (assuming 9 
months gestation) to adulthood for female and male humans (Barbier 
et al., 2012, 2015), as fitted with a double sigmoid, and for 12 female 
and 13 male baboons (assuming 6 months of gestation), as modeled 
with a simple sigmoid. 


VTL (em) 


10 
AGE (years) 


Length analysis of the vocal tract’s oral segment as normalized to dOC3 
shows that there is additional growth linked to prognathism in baboons 
from youth into adulthood (Figure 12.8, Oral Cavity Length). The fact that 
the vocal tract grows in both its OCL and LH dimensions results in stability 
of the Larynx Height Index (Figure 12.9). In addition, we measured VTL 
(Figure 15) using points manually placed on the back of the tongue and 
including the two oral and pharyngeal segments, which showed a one-step 
variation during baboon growth (Figure 16). In contrast, human data show 
a two-step VTL growth that is consistent with laryngeal descent, which 
takes place in two stages as well. We conclude that vocal tract growth in 
the baboon is less than humans in its pharyngeal region, but much greater 
than humans in the oral region (Goldstein, 1980). This results in a large 
difference in LHI between baboons and humans: 0.43 for baboons vs. 0.75 
for female and 1.0 for male humans. Remarkably, laryngeal descent in 
humans compensates for the lack of prognathism, with the ultimate effect 
of preserving vocal tract length. 
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4. Conclusion 


In this study, we propose a series of new qualitative and quantitative bio- 
metrical analyses of oropharyngeal anatomy adapted and normalized for 
the comparison between baboons and humans. Using the vertebral column 
as a fixed phylogenetic reference, we derive a measure of larynx height (the 
LHI), and also of variations of the oral cavity length expressed with a new 
metric. We show that the distance of one vertebra separating the hyoid 
bone and the glottis appears to be invariant, despite the great morphologi- 
cal differences illustrated in Figure 4. A new representation of the laryngeal 
descent process is summarized on Figure 14. In comparison to the young 
baboon, we find that the adult baboon has a single-stage laryngeal descent 
of only % vertebra. The oral cavity then grows an equivalent length, thus 
keeping LHI constant. The human newborn is at the level of the adult ba- 
boon, and humans undergo two descent stages, cumulatively amounting to 
1% vertebra for females and 2 for males. Since this is realized without the 
increased oral length from prognathism, it results in a large LHI increase. 

Functionally, the distance of 1 vertebra between the hyoid bone and the 
glottis is highly constrained by the mechanical requirements of swallowing. 
As baboon tongue musculature is similar to that of humans, if we consider 
the hyoid bone as the tongue root, laryngeal descent in humans corresponds 
to a tongue shift toward the back, without modification of the relationship 
between the tongue and the oropharyngeal cavity because the length of the 
oral cavity is preserved. We assume this is also constrained by the swallow- 
ing function, because the role of the tongue is to drive the food, solid or 
liquid, from the lips to the esophagus. In baboons, the tongue tip appears to 
receive no fibers from GGa. This suggests that in humans, laryngeal descent 
divided the vocal tract in two cavities with a specialization of the anterior 
part in chewing and preparation of the food for ingestion. The corollary 
of this specialization was to acquire a better musculature of the tongue tip. 

All these observations are compatible with the hypothesis of speech 
exaptation from feeding gestures. To begin with, tongue musculature in 
baboons and humans is similar, with the same two control axes enabling 
constrictions and cavities to form. Swallowing actually involves vocal tract 
constrictions, and even if the tongue is flat in baboons, the relationship be- 
tween tongue and oropharyngeal cavity remains similar to that of humans. 
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Fitch (2010) recapitulates the argument that laryngeal descent is required to 
form the posterior cavity to produce the vowel /u/, and more generally, the 
human vowel triangle. While baboons can probably not produce a vocaliza- 
tion close to /i/ because their tongue tip musculature is not developed, a high 
larynx is not an obstacle preventing vocalizations with formants diverg- 
ing significantly from the central vowel. More precisely, we consider that 
laryngeal descent is not required for the development of a posterior cavity. 
In fact, the stability of vocal tract length and the constant relationship of 
tongue to oropharynx ensure that the production of an /u/-like vocalization 
is possible for baboons through contraction of the styloglossus, the same 
muscle as in humans, allowing a retraction of the tongue and forming a 
posterior cavity (Figure 8; also see Boé et al., 2017). Constriction control, 
already present to allow swallowing, is the crucial factor in production of 
distinct vowel qualities, and this is clearly an example of exaptation from 
feeding gestures for speech. 
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Evolution of the Laryngeal Motor Cortex 
for Speech Production 


Abstract: Considerable progress has been recently made in understanding the brain 
mechanisms underlying speech and language control. One of the important but 
oftentimes overlooked aspects of speech production is the ability of the primary mo- 
tor cortex to control fine movements of the laryngeal muscles for the production of 
learned vocalizations. In this respect, the laryngeal motor cortex is indispensable not 
only for the development of novel articulatory sequences but also for coordination 
of sensorimotor interactions for smooth speech motor output. In this chapter, we 
discuss the comparative organization and function of the laryngeal motor cortex in 
humans and non-human primates and provide some insights into the evolutionary 
importance of this cortical region in shaping human speaking. 


Keywords: laryngeal control, non-human primates, laryngeal motor cortex 


1. Introduction 


One of the most puzzling questions in evolutionary biology is that of the 
unique human capacity for speech and song. Singers in particular exhibit 
a remarkable control of laryngeal muscles through prolonged expiration, 
phonation, and pitch. During phonation for speech and song, voice onset 
is precisely timed, which allows linguistic distinctions between voiced and 
voiceless consonants. Changes in the subglottal pressure due to changes in 
lung volume, the elastic properties of the chest wall, and the active contrac- 
tion of the intercostal and abdominal muscles lead to modulations of voice 
intensity, whereas the resonance characteristics of the supraglottal region 
(e.g., oral and pharyngeal cavities) influence the spectral properties of the 
sound. Vocal fold movements are controlled by intrinsic and extrinsic laryn- 
geal muscles. The intrinsic laryngeal muscles are confined to the larynx and 
participate in vocal fold closure (thyroarytenoid, TA, lateral cricoarytenoid, 
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LCA, and interarytenoid muscles, IA), opening (posterior cricoarytenoid 
muscle, PCA), and lengthening (cricothyroid muscle, CT) (Fig. 1C,D). The 
extrinsic muscles connect the larynx with surrounding structures, such as 
the hyoid bone, sternum and pharynx, and raise or lower the larynx within 
the neck relative to the spine to modulate vocal fold length, fundamental 
frequency, oro-pharyngeal resonance frequencies and formant structure. 
In humans, fine movements of the laryngeal muscles are under control 
of the laryngeal motor cortex (LMC). Bilateral lesions of human LMC 
abolish speech and song production but preserve innate vocalizations such 
as laughing and crying (Amassian et al., 1933; Groswasser et al., 1988; 
Jurgens et al., 1982; Mao et al., 1989), which are types of non-verbal vo- 
calizations controlled by subcortical structures and present even in infants 
(Jurgens, 2002; Simonyan and Horwitz, 2011). On the other hand, nonhu- 
man primates, such as the rhesus monkey and squirrel monkey, produce 
innate vocalizations and have a very limited, if any, ability to learn new 
vocalizations. Laryngeal muscles in nonhuman primates, too, are controlled 
by the LMC, albeit to a lesser extent. However, bilateral lesions of LMC 
in monkeys have no effect on their vocalization, such as the LMC destruc- 
tion does not affect their ability to produce species-specific calls (Jurgens et 
al., 1982; Kirzinger and Jurgens, 1982; Sutton et al., 1974). As we discuss 
below, these behavioral discrepancies may be explained by the location and 
structural organization of the LMC in humans vs. nonhuman primates, 
potentially contributing to the evolution of our ability to speak. 


2. Specification of LMC localization and organization 


One of the factors contributing to LMC developmental maturation in hu- 
mans lays in its localization. Electrical stimulation in rhesus macaque and 
squirrel monkeys has long localized an isolated region of laryngeal muscle 
representation within the ventral premotor cortex (Brodmann area 6) be- 
tween the inferior arcuate sulcus and subcentral dimple (Hast et al., 1974; 
Jurgens, 1982; Simonyan and Jurgens, 2002, 2003, 2005a, b) (see Figure 1). 
Stimulation of this region produces isolated and symmetrical adduction of 
the vocal folds but no vocalization. More posteriorly in the primary mo- 
tor cortex (Brodmann area 4), the laryngeal movements are elicited only 
in combination with orofacial movements, suggesting that an isolated and 
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segregated representation of the LMC in nonhuman primates is limited to 
the premotor cortex. 

In humans, Penfield and colleagues have mapped the “vocalization” area 
within the primary motor cortex, stimulation of which elicited speech-like 
sounds (Penfield and Boldrey, 1937) (Figure 2). However, since these first 
studies in 1930s, the exact localization of the laryngeal muscles within this 
“vocalization” region (which is a behavior and not a body part representa- 
tion) remained largely unknown. First systematic exploration of the primary 
motor cortex for the localization of the LMC in humans was performed by 
Rodel and colleagues in 2004, who used transcranial magnetic stimulation 
to selectively stimulate the intrinsic laryngeal muscles (cricothyroid and 
thyroarytenoid) (Rodel et al., 2004). 
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Figure 1: (A) Schematic drawing of the sites of vocalization elicitation during 
direct electrical stimulation of the primary motor cortex in humans 
(Penfield & Rasmussen, 1950) and the sagittal section of the brainstem 
depicting the distribution of degenerating fibers (small dots) in the 
nucleus ambiguus (Amb) and surrounding reticular formation (Kuypers, 
1958). The arrows represent the direct (monosynaptic) connections 
from the LMC to the reticular formation and nucleus ambiguus, the site 
of laryngeal motoneurons, which project to the laryngeal muscles. (B) 
Schematic drawing of topographic representation of the intrinsic and 
extrinsic laryngeal muscle in the premotor cortex (Hast et al., 1974). Sca 
— subcentral dimple; right-angled triangle — cricothyroid muscle; circle 
— thyroarytenoid muscle; encircled right-angled triangle — combination 
of the cricothyroid and thyroarytenoid muscles; square — extrinsic 
laryngeal muscles. (bottom) The cross-section of the brainstem and 
photomicrographs show terminal fields of the laryngeal motor cortical 
projections in the reticular formation (RF) but not nucleus ambiguus in 
the rhesus monkey, which was injected with the anterograde tracer, biotin 
dextranamine, into the LMC (Simonyan and Jurgens, 2003). The arrows 
show indirect connection of the LMC with the nucleus ambiguus via the 
surrounding reticular formation. The scale bar corresponds to 50 mm. 
Adapted from Simonyan (2014). 
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This study has not only provided the first description of the isolated laryn- 
geal muscle representation in the motor cortex but also localized it to Brod- 
mann area 4 in humans as opposed to Brodmann area 6 area in nonhuman 
primates. More recent functional MRI and electrocorticography (ECoG) 
studies have confirmed this initial finding and provided a more detailed 
characterization of the LMC somatotopy within the speech motor cortex 
and connectivity. Specifically, these studies localized the LMC to the sub- 
division of primary motor cortex, namely area 4p (Bouchard et al., 2013; 
Kumar et al., 2016; Simonyan, 2014) , as well as identified another segre- 
gated LMC location in area 6 of the premotor cortex (Simonyan, 2014), 
which is similar to the LMC location in nonhuman primates (Simonyan 
and Jurgens, 2002) (see Figure 2). 


Figure 2: (A) The ‘Motor sequence’ within the primary motor cortex with the 
extensive vocalization region in the inferior portion of the precentral 
gyrus (Penfield and Jasper, 1954). (B) Meta-analysis of 19 fMRI 
studies in humans between 2000 and 2013 using activation likelibood 
estimation (ALE) of brain function during voice production. Bilateral 
peaks of LMC activation are found in the area 4p with an additional 
peak of activation in the left area 6 (Simonyan, 2014). Adapted from 
Simonyan (2014). 
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Nevertheless, the LMC in area 4p in humans and area 6 in nonhuman pri- 
mates are presumed to be functionally homologues because stimulation of 
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both regions evokes isolated and segregated intrinsic laryngeal movements 
(Kumar et al., 2016). It is plausible that the dual representation of this 
region in both primary motor and premotor cortices in humans as well as 
the shift of its functionality in the control of the laryngeal muscles to the 
primary motocortical location in humans may have played a role in its abil- 
ity to coordinate more complex vocal tasks, such as production of speech 
and song. As this premotor position is present in both rhesus and squirrel 
monkeys, separated by 35 million years, it is possible that this shift to and 
involvement of the primary motor cortex evolved de novo in humans (Si- 
monyan, 2014). The position of LMC within the primary motor cortex have 
presumably allowed for more enhanced coupling of expiration, phonation, 
and articulation, which are the three essential aspects of voice production 
for speech and song, within the same region. Moreover, this ‘new’ LMC in 
the primary motor cortex may have played a crucial role in the evolution 
of human speech, through expanded structural and functional connectivity 
with other brain regions involved in vocalization and improved regulation 
of speech-motor planning and feedback. The secondary ‘old’ localization 
of LMC in area 6, which is present in both humans and nonhuman pri- 
mates, potentially controls more universal laryngeal functions present in 
both species, such as participation in breathing and other related function 
associated with straining of the laryngeal muscles like during jumping, 
lifting heavy weights, etc. 


3. Connectivity of the laryngeal motor cortex 


Another important factor contributing to the maturation and evolution 
of the LMC is the establishment of direct monosynaptic connections with 
laryngeal motoneurons located in the nucleus ambiguus of the brainstem. 
These neurons control intrinsic laryngeal muscles, which are responsible for 
a variety of laryngeal behaviors, including voice production. In contrast, 
the corticobulbar projections from the LMC in the monkey are indirect and 
first synapse in the surrounding reticular formation and brainstem phona- 
tory sensory nuclei (Iwatsubo et al., 1990; Jurgens, 1976; Kuypers, 1958; 
Simonyan and Jurgens, 2003). Because of this direct in humans vs. indirect 
in monkeys control of brainstem laryngeal motoneurons, the LMC ability to 
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finer modulation of laryngeal movements for learned vocalization is much 
more limited in monkeys compared to humans (see Figure 1). 

Despite the lack of direct LMC-ambiguus connection, the neuroanatomi- 
cal tract tracing studies in nonhuman primates, such as rhesus monkey and 
squirrel monkey, have established that the LMC has extensive connections 
to a number of cortical and subcortical regions (Jurgens, 1976; Simonyan 
and Jurgens, 2002, 2003, 2005a, b). Diffusion MRI studies in humans have 
revealed similar structural connectivity (Simonyan et al., 2009; Kumar et 
al., 2016). LMC in both species is highly connected to sensorimotor re- 
gions (primary somatosensory area, supplementary motor area), auditory 
processing and sensorimotor integration centers (superior temporal cortex, 
inferior parietal cortex, prefrontal gyrus, supramarginal gyrus) as well as 
limbic regions (cingulate gyrus, insula), basal ganglia and thalamus (see 
Figure 3). The overall similarity of LMC networks between humans and 
nonhuman primates might explain why monkeys are able to control the 
timing and duration of calls (Coude et al., 2011), while there are not able 
to learn finer motor control of their laryngeal muscles for more complex 
voice production as during speaking. It also suggests that the evolution 
of human specch relied largely on the maturation and reorganization of 
neural pathways within this vast structural network that is already in place 
in nonhuman primates. 
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Figure 3: LMC cortical and subcortical networks. Block diagrams illustrate the 
reciprocal (left box), outgoing (middle box), and incoming (right box) 
connections of the LMC as defined using neuroanatomical tracing 
studies in nonhuman primates and diffusion tensor tractography in 
humans. Asterisk (*) indicates that the projection from the laryngeal 
motor cortex to the nucleus ambiguus exists only in humans but not in 
nonhuman primates. Adapted from Simonyan and Horwitz (2011). 


Laryngeal Motor Cortex 


In addition, several other cortical LMC connections appear to be enhanced 
that have potentially allowed for the development of our ability to better 
control sensorimotor integration for speech production. A recent study 
examining LMC connectivity with probabilistic diffusion weighted tractog- 
raphy has revealed a remarkable 7-fold increase in LMC-parietal/temporal 
connectivity strength in humans compared to nonhuman primates (rhesus 
monkey) (Kumar et al., 2016) (see Figure 4). Within these connection is 
the supramarginal gyrus (SMG) of the inferior parietal lobule (IPL). Con- 
nections of the IPL with Broca’s area and auditory areas suggest a key role 
of this region in language development, auditory language processing, and 
integration of auditory word forms to generate speech (Caspers et al., 2011; 
Price, 2000). Functional studies also implicate left IPL in the semantic and 
phonological processing network (Vigneau et al., 2006), while its lesions 
are known to produce receptive aphasia (Alexander et al., 1989; Hart and 
Gordon, 1990; Kertesz et al., 1982). 
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Figure 4: Quantitative distribution of LMC connections in human (A) and 
nonhuman primates (B). Connectivity fingerprints show the proportion 
(%, in logarithmic scale) of LMC tracts reaching each target region 
in humans and macaques. The corresponding percentage of each 
tract contribution is given on the right. S1 — primary somatosensory 
cortex, IFG — inferior parietal gyrus, STG — superior temporal gyrus, 
IPL — inferior parietal lobule, SMA - supplementary motor area, 
ACC - anterior cingulate cortex, MCC — middle cingulate cortex, Put — 
putamen, Cd — caudate nucleus, Gp - globus pallidus, Thal - thalamus. 


RSI- 11.14% 
LSI- 11.08% 
*RPGM- 8.69% 
*LFGS4-431% 
*R GMS - 2.02% 
*LIFGSS-218% 
*RSTG- 1.89% 
“LSTG-2482% 
*RPL-2577% 
*L IPL - 23.08% 
*RSMA-001% 
*LSMA- 104% 
RACCOON 
LACS -0,002% 
*REKC-0.001% 
*LMCC + 0.11% 
*R Pur - 206% 
@LPuT- 121% 
"RCO-0.02% 
LCo- 001% 
SRGP- 108% 
“LG-041% 
SR Thar - 0.04% 
aL Tea -0.18% 


Macaques: 


*R91-170% 
*LS1-3.35% 

SR IFG - 21.61% 
=LIFGM4 - 21.51% 
*RIFGMS -6.14% 
*LIFGMS-641% 
sRSTG- 302% 
“LSTG-410% 
SRIPL- 1.04% 
*LPL-360% 
SR SMA- 0M% 
*LSMA- 0.085% 
"RACE 0.00% 
LACS - 0.54% 
SR MOC -0.02% 
*LMCC-0.15% 
*RPur- 8.28% 
*L Pur -S62% 
“RCO-0.53% 
—Co+152% 
SRG- 369% 
“UL Gr- 200% 
SRT - 1.40% 
EL Thar - 143% 


146 Veena Kumar and Kristina Simonyan 


Other diffusion tractography studies have illustrated that dorsal arcuate 
fasciculus projecting to IPL was present in chimpanzees as well as humans 
as opposed to rhesus monkeys (Petrides and Pandya, 2009; Rilling et 
al., 2008, 2011, 2012). The arcuate fasciculus, which connects Broca’s 
and Wernicke’s area (Catani et al., 2005), is crucial for word-learning 
in humans (Lopez-Barroso et al., 2013). As a relay between frontal and 
temporal regions, IPL maps auditory stimuli into lexical and articulatory- 
gestural representations for speech production (Bohland and Guenther, 
2006; Guenther, 2006; Jardri et al., 2007; McNamara et al., 2008). It is 
one of only a few sensorimotor centers, which coordinates speech produc- 
tion as well as comprehension (Fiebach et al., 2007; Hocking et al., 2009; 
Simonyan and Horwitz, 2011; Zheng et al., 2010). Taken together, the 
expansion of these LMC-parietal/temporal connections in humans points 
to the importance of somatosensory feedback, such as proprioceptive and 
tactile sensation, in the perception of speech and modulation of motor 
activities leading to the production of a complex learned vocal behavior. 
Whether the new IPL/STG - LMC connections originated de novo in 
humans or rather, like the arcuate fasciculus, in chimpanzees remains to 
be investigated. 

While LMC-parietal/temporal pathway emerged as most dominant, com- 
parison of the broader LMC network in humans and macaques points to the 
importance of many other regional hubs in higher-order processing centers. 
Subcortical connections to LMC appear to be bilateral and structurally 
similar in both species. Basal ganglia circuitry dysfunction is implicated 
in a variety of motor disorders, such as Parkinson’s disease, tremor, and 
dystonia. Based on neuroanatomical tract tracing studies in macaques and 
probabilistic diffusion tractography in humans, the putamen is the major 
basal ganglia output target of LMC (Kumar et al., 2016; Simonyan and 
Jurgens, 2003). Lesions in the putamen result in dysarthria and dysphonia 
in humans but do not alter monkey innate vocalization (Jurgens, 2002; 
Jurgens and Ploog, 1970), suggesting that it has a greater role in the pro- 
duction of learned vs. innate vocalizations. 

Tracts from LMC to superior temporal gyrus (STG) are more widely 
distributed in humans and left-lateralized when compared to macaques (Ku- 
mar et al., 2016; Simonyan et al., 2009). Bilateral isolated lesions in STG 
cause ‘word deafness,’ or inability to comprehend heard speech (Alexander 
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et al., 1989; Buchman et al., 1986; Hart and Gordon, 1990; Kertesz et al., 
1982). The diversification of left LMC-STG tracts in humans represents the 
importance of auditory feedback in rapid modification of speech production 
as opposed to innate vocalization. STG is activated during speech percep- 
tion but also vocalization. Furthermore, fMRI studies show that LMC is 
activated mostly during syllable production, but also in speech perception 
(Wilson et al., 2004). The coupling of these regions points to an important 
role for LMC in auditory processing and integration into motor movements. 


4. Conclusion 


Although investigation into the role of LMC in speech has undergone great 
developments in the past decade, much is yet to be explored. We can assert 
that humans have undergone three important evolutionary modifications 
with respect to the LMC organization. First, the development of direct 
neuronal connections to nucleus ambiguus allowed pathways to bypass 
other relay stations, resulting in direct control of laryngeal motoneurons in 
the brainstem. Second, the shift of functionally active LMC from premotor 
to primary motor cortex facilitated a more precise control of the laryngeal 
muscles for production of complex vocal tasks. Third, this shift allowed for 
expansion of structural and functional LMC cortical connectivity with the 
parieto-temporal regions in addition to subcortical pathways. 
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Motor and Communicative Correlates of 
the Inferior Frontal Gyrus (Broca’s Area) 
in Chimpanzees 


Abstract: The inferior frontal gyrus (IFG) encompasses Broca’s area, a brain region 
implicated in a variety of cognitive and linguistic functions. For instance, clinical 
and experimental data suggest that the left IFG plays an important role in language 
and speech. In this paper, I briefly summarize data on the sulci and morphological 
landmarks that define the IFG in humans, chimpanzees and monkeys. I also present 
some preliminary data on the surface area, mean depth and gray matter thickness of 
the three primary sulci that comprise the IFG in the chimpanzee brain including the 
fronto-orbital, precentral inferior and inferior frontal sulci. I further present data 
on associations between individual variation in asymmetries of each sulcus with 
measures of oro-facial motor control and tool use skill. The implications of these 
findings for different theories on the evolution of language and higher order motor 
and cognitive functions in primates are discussed. 
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1. Introduction 


A portion of the inferior frontal gyrus (IFG) in the human brain consti- 
tutes Broca’s area, named for the famous French physician Paul Broca 
who originally described this brain regions role in language and speech 
production in patients that suffered damage to this area (Broca, 1861). 
Specifically, Broca obtained the post-mortem brains of two of his patients 
that showed deficits in speech production and found that both individuals 
had pronounced damage to the left inferior frontal gyrus (Dronkers et al., 
2007). From these results, Broca described the critically important role of 
the inferior frontal gyrus for the faculty of language in humans, particularly 
in the left hemisphere. More than 150 years later, the conclusions drawn 


154 William D. Hopkins 


by Broca have largely been confirmed in both the clinical and experimental 
literature. For instance, an abundance of data has shown that deficits in 
speech (aphasia) and praxic functions (motor planning) are significantly 
more prevalent when damage occurs in the left compared to right hemi- 
sphere (Goldenberg and Randerath, 2015; Meador et al., 1999). Findings 
from the Wada test and more recently using transcranial doppler sonogra- 
phy have shown convincingly that a majority of individuals, particularly 
right-handed subjects, are left lateralized for language (Knecht et al., 2000; 
Rasmussen and Milner, 1977). More recent functional imaging studies have 
shown that Broca’s area (along with other cortical areas) play an important 
role in a variety of linguistic and other cognitive functions (Cooper, 2006; 
Fazio et al., 2009; Horwitz et al., 1999; Makuuchi, 2005; Nishitani et al., 
2005; Passingham, 1981). Lastly, the evidence that lesions to the left but 
not right hemisphere affected language functions spurred considerable inter- 
est in the concept of hemispheric specialization. Indeed, given the robust 
nature of lateralization for language, it is difficult to discuss the evolution 
of language without consideration of the topic of hemispheric specialization 
(Bradshaw and Rogers, 1993; Corballis, 1992, 2003). 

Given the historic and more recent significance placed on the role of 
Broca’s area in language, higher order cognition and motor control, there 
has naturally been considerable interest in the evolution of this brain re- 
gion, particularly within primates and specifically among great apes (Loh 
et al., 2016). The motivation for studies on the evolution of the IFG has 
no doubt been stimulated by the growing body of evidence of sophisti- 
cated cognitive and some language-like abilities displayed by primates in 
comparison to more distantly related mammalian species (Crockford et al., 
2004; Crockford et al., 2012; Schel et al., 2013; Seyfarth and Cheney, 2010; 
Slocombe and Zuberbühler, 2005; Slocombe and Zuberbühler, 2007). For 
instance, more than 50 years of so-called ape language research has dem- 
onstrated that apes can acquire and use augmentative or alternative com- 
munication systems for interspecies communication (Lyn, 2012, in press). 
When considering their natural communication, there is recent evidence 
of intentional vocal communication in both captive and wild chimpanzees 
(Crockford et al., 2012; Hopkins et al., 2007a; Hopkins et al., 2011; Leav- 
ens et al., 2014a; Schel et al., 2013) suggesting that they have voluntary 
control of their vocalizations, findings that challenges many historical and 
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contemporary views of primate vocalizations (Premack, 2004; Seyfarth and 
Cheney, 2010). There is also evidence that nonhuman primates, and par- 
ticularly great apes, produce intentional, referential gestures, sometimes in 
sequences, during inter- and intra-specific communicative events (Cartmill 
and Byrne, 2007; Gentry and Byrne, 2010; Hobaiter and Byrne, 2011a; 
Hobaiter and Byrne, 2011b; Leavens et al., 1996; Leavens et al, 2004b; 
Leavens et al., 2015; Leavens et al., 2005a; Liebal et al., 2004; Pika et al., 
2003; Pika and Mitani, 2006; Pollick and de Waal, 2007), which some 
believe supports a gestural origins view of language evolution (Arbib et al., 
2008; Corballis, 2003; Corballis et al., 2012). Interestingly, studies in cap- 
tive and to a lesser degree wild apes as well as baboons have demonstrated 
consistent evidence of population-level right handedness during intra- and 
inter-specific gestures, suggesting a left lateralized system for gestural com- 
munication (Hobaiter and Byrne, 2013; Hopkins et al., 2012a; Hopkins et 
al., 2005; Meguerditchian and Vauclair, 2006; Prieur et al., 2016a, 2016b). 

The purpose of this paper is two-fold. First, a summary of the available 
literature on the morphology and cellular organization of Broca’s area in 
primates with a particular emphasis on data from chimpanzees is presented 
in the first part of the chapter. Second, I present some new descriptive data 
on different dimensions of the sulci comprising Broca’s area in chimpan- 
zees as well as their association with communicative, cognitive and motor 
functions. I emphasize throughout the paper what information seems well 
established and what limitations exist within the literature. 


2. Morphology and cellular organization of Broca’s area in 
primates 


2.1 Broca’s Area: Morphology 


The morphology of Broca’s area in the human and chimpanzee brain has 
been eloquently described in a series of papers by Keller and colleagues (Kel- 
ler et al., 2012; Keller et al., 2007; Keller et al., 2009). In the human brain, 
morphologically, Broca’s area is comprised of two primary regions, the 
Pars opercularis (ParsO) and Pars triangularis (ParsT) (see Figure 1a). The 
ParsO is defined posteriorally by the precentral inferior sulcus, superiorly 
by the inferior frontal sulcus and anteriorly by the ascending ramus. The 
ParsT is the gyrus that lies between the horizontal sulcus or rami and the 
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anterior ramus. Connecting the anterior terminal points of the horizontal 
and ascending rami closes the formation of the “triangularis” gyrus. In hu- 
mans, within the ParsO, there is also the presence of a small, shallow fold 
in some brains called the dimple. Though the volume of Pars opercularis 
and Pars triangularis have historically been considered anatomically larger 
in the left compared to right hemisphere, according to Keller et al. (2007) 
and others (Knaus et al., 2007; Tomaiuolo et al., 1999), the evidence is not 
as compelling as once thought. For instance, Keller et al. (2007) reported 
that a left hemisphere asymmetry in the gray matter volume of the Pars 
opercularis was contingent upon the presence of a dimple (a small fold) 
within the gyrus. In those brains that lacked the dimple, no interhemispheric 
differences in gray matter volume were evident. 

In great apes, Broca’s area is less well developed but there is some 
homology with respect to the ParsO in humans (Keller et al., 2009) (see 
Figure 1b). Specifically, in chimpanzees, bonobos, gorillas and orangu- 
tans, the ParsO can be anatomically defined using essentially the same 
landmarks as used in human brains. The posterior border is the precentral 
inferior (PCI) sulcus while the superior border is the inferior frontal sulcus 
(IFS). The anterior border is the fronto-orbital (FO) sulcus, which is the 
same sulcus as the anterior ramus in the human brain. Thus, the same 
sulci can be used to define the ParsO. In contrast, the great ape brains all 
lack a horizontal ramus which therefore prevents the measurement of a 
homolog to ParsT. 

In more distantly related Old and New World monkeys, from the stand- 
point of morphology, there are no common sulci landmarks with either 
chimpanzees or humans that can be used to define either ParsO or ParsT. 
To be clear, there are sulci landmarks within the inferior portion of the 
frontal lobe in monkeys that can be quantified and arguably could be used 
as proxy measures to Broca’s. For example, the arcuate and principalis 
are two sulci that upon appearance of their spatial location seem to be 
analogous to the inferior frontal sulcus and precentral inferior sulcus 
(Figure 1c). The challenge in defining a morphological homolog to either 
the ParsO or ParsT is the absence of a sulcus that can serve as an anterior 
border for either region. 
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Figure 1: 3D rendering of (a) human brain (b) chimpanzee brain and (c) rhesus 
monkey brain with the sulcal landmarks used to define the inferior 
frontal gyrus or ventrolateral prefrontal region. 
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2.2 Broca’s Area: Cytoarchitectonics 


Though there has been tremendous historical and contemporary interest in 
the cellular organization and morphology of Broca’s area (Zilles and Amunts, 
2010), there are remarkably few studies on the cellular organization of this 
region in human and nonhuman primates (Schenker et al., 2007). Brodmann 
(1909) was the first to systematically describe the cytoarchitectonic regions 
of Broca’s area subsequently labeled Area44 and Area45. In the functional 
neuroimaging literature, Area44 and Area45 are often used or considered 
synonymous with the ParsO and ParsT but the relationship between the cel- 
lular and morphological definition of Broca’s area is not perfect (Amunts et 
al., 2007; Zilles and Amunts, 2010). In humans, generally, Area44 cells are 
found at a greater probability within the ParsO while Area45 cells are found 
within ParsT (Amunts et al., 1999; Uylings et al., 2006). For chimpanzees, 
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like in humans, the spatial distribution of Area44 cells are more consistently 
found within the ParsO whereas Area45 cells are found in greatest probability 
in the gyrus and cortical fold immediately anterior and superior to the frontal 
orbital sulcus (see Figure 2) (Schenker et al., 2010). When considering the 
volume and number of neurons comprising Area44 and Area45 (see Table 1), 
there are two notable differences between humans and chimpanzees. First, 
humans tend to show a much more consistent leftward asymmetry compared 
to the chimpanzees. Second, as noted by Schenker et al. (2010), there has been 
a 6- to 7-fold volumetric expansion in the Area 44 and Area 45 in humans 
compared to chimpanzees. 


Figure 2: 3D rendering of a chimpanzee brain with probabilistic distribution of 
Area 44 and Area 45 cells projected onto the surface of the cortex (see 
Schenker et al., 2010). 


There are two additional sets of data and observations regarding Area 44 
and Area 45 in the human and chimpanzee brain. First, Schenker et al. 
(2010) created probabilistic maps of Area 44 and Area 45 in their chim- 
panzee sample by registering the cellular data to ex vivo MRI scans in the 
same stereotaxic space. Then they computed the percentage of overlap in 
the location of Area 44 and Area 45 voxels that were present in at least 
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50% of the chimpanzees. For Area 44, the volume of the left and right 
hemisphere probabilistic maps is respectively 642mm? and 349 mm’. For 
Area 45, the volume of the left and right hemisphere probabilistic maps 
is respectively 280mm? and 249 mm’. The average native space left and 
right volumes for Area 44 and Area 45 were 601 mm’, 648 mm’, 501 
mm’? and 633 mm’, respectively. Thus, for Area 44, the average native left 
hemisphere volume was very similar to the probabilistic volume (601 mm? 
vs 642 mm?) whereas for the right hemisphere, the probabilistic volume 
(349 mm?) was nearly half that of the native space (648 mm’). For Area 
45, for both the left and right hemisphere there were significant reduction 
in volume between the native and probabilistic volume but it was much 
greater for the right (633 mm? vs 249 mm’) compared to left (501 mm? vs 
280 mm?) hemisphere. One interpretation of these results is that the spatial 
location of Area 44 and Area 45 cells are much more consistent across 
subjects in the left compared to right hemisphere, particularly for Area 44. 
Second, in a more recent paper, Spocter et al. (2012) examined neuropil 
space in several brain regions in humans and chimpanzees including Area 
45 (see Table 1). The proportion of neuropil within a region of the cerebral 
cortex represents a key aspect of neuroanatomical organization, indicating 
its functional capacity. In a Nissl stain, the unstained portion of the tissue 
comprises the space occupied by synapses, dendrites, and blood vessels. A 
simple calculation of the neuropil fraction (NF) of tissue can be obtained 
by converting a high-resolution image of Nissl-stained sections to binary 
and calculating the ratio of the tissue compartment that is stained (i.e., cell 
bodies of neurons, glia, and endothelial cells) versus unstained. For a ma- 
jority of the regions, including Area 45, humans have a smaller NF value 
compared to the chimpanzees suggesting a higher proportion of synapses 
in this region. 

Unfortunately, there are no available data on the volume, neuron number 
or cell density for Area 44 and Area 45 in other great apes. For macaque 
monkeys, Petrides and colleagues (Petrides et al., 2005; Petrides and Pan- 
dya, 1999, 2009) have done the most extensive analyses and these authors 
have identified two distinct Area 45 (a & b) regions within the gray matter 
comprising the anterior fold of the arcuate sulcus (see Figure 1c). Specifi- 
cally, Area 44 cells are found in the caudal bank and Area 45 cells are found 
in the rostral of the inferior portion of arcuate sulcus. Interestingly, electri- 
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cal stimulation of the BA44 region elicits oro-facial and fingers movements 
in macaque monkeys (Petrides et al., 2005). 


2.3 Sulci Surface Area, Depth and Gray Matter Thickness 


Figure 3: Pipeline process for sulci extraction in BrainVisa. a. T1-scan; 
b. inbomogeneity correction; c. mask of tissue from skull; d. split brain 
mask and segmentation of the cerebellum; e. gray and white matter 
delineation; f. white matter mould; g. gray matter mould. 
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As noted above, the Pars opercularis of the chimpanzee brain is defined by 


three sulci including the fronto-orbital, precentral inferior and inferior fron- 
tal. Rather than focus on the volume, surface area and cortical thickness of 
the entire gyrus, it is also possible to quantify the surface area, mean depth 
and gray matter thickness of each of these folds using software programs 
such as BrainVisa (see Figure 3). BrainVisa (BV) is a software program that 
allows for extraction of the cortical folds or sulci and enables quantification 
of variability in their surface area, maximum and mean depth, length, gray 
matter thickness and fold opening (see Figure 4). We have previously used 
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BV to quantify individual and phylogenetic variability in the surface area 
and mean depth of the central sulcus (Cykowski et al., 2008; Hopkins et al., 
2010; Hopkins et al., 2014; Hopkins, et al., 2015a). Here, I present data 
on the surface areas, mean depth and gray matter thickness for the three 
sulci used to define the Pars opercularis in the chimpanzee brain including 
the precentral inferior (PCI), inferior frontal (IFS) and fronto-orbital (FO) 
sulci. Specifically, my laboratory has recently quantified the surface area, 
mean depth and gray matter thickness of each sulcus in 271 in vivo and 
post-mortem chimpanzee MRI scans and the descriptive data for each sul- 
cus and measure are shown in Table 2. Chimpanzees showed a significant 
leftward asymmetry in the surface area for FO and a rightward asymmetry 
for PCI. In terms of mean depth, significant leftward asymmetries were 
evident for the FO and IFS. Lastly, a significant leftward asymmetry was 
found for gray matter thickness in the PCI sulcus. 


Figure 4: a. 3D rendering of chimpanzee brain with sulci boundaries of Broca’s 
area indicated in green, yellow and blue; b. extraction of example sulcus, 
in this case, the central sulcus and how the surface area, depth and gray 
matter thickness measures are derived. 


3. Behavioral Associations with Broca’s Area in 
Chimpanzees 


As noted above, Broca’s area is linked to a variety of praxic, linguistic, and 
cognitive functions in the human brain. In our laboratory, we have been 
particularly interested in the potential role that Broca’s area plays in praxic 
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and communication functions. Thus, in the following sections, I present an 
overview of findings on the associations between praxic and communicative 
functions and Broca’s area in chimpanzees. I also present some new data on 
the relationship between sulci variablity within the inferior frontal gyrus in 
relation to communication and praxic skill in tool use actions. 


3.1 Gestural and Vocal Communication 


Studies in our group have previously demonstrated that chimpanzees use 
manual gesture intentionally and referentially, and also show a prominent 
right-hand preference during both inter- and intra-specific interactions 
(Hopkins et al., 2005; Leavens et al., 1996; Leavens et al., 2004b). Addi- 
tionally, our group has previously reported that chimpanzees produce what 
we define as attention-getting (AG) sounds. As the name implies, AG sounds 
are produced to capture the attention of an otherwise inattentive audience. 
For instance, as their first communicative response, captive chimpanzees 
are more likely to produce an AG sound when a food is visible to them but 
a human experimenter is looking away from them compared to when the 
experimenter is looking at or facing them (Hostetter et al., 2001; Leavens et 
al., 2004c). This suggest that the apes are monitoring the visual orientation 
of the human and can selectively choose to produce an AG sound when it 
is necessary to get the human’s visual attention before manually requesting 
a food, compared to situations in which they are sharing the same visual 
line of sight. Behaviorally, there are three other important findings on the 
use of AG sounds by chimpanzees including (1) they are often produced in 
conjunction with manual gestures (Hopkins and Cantero, 2003) (2) only 
around 50% of our sample reliably produces these sounds and (3) there 
is some evidence that they are heritable, possibly though social learning 
(Taglialatela et al., 2012). 

With respect to the neural correlates of hand use for gestural as well as 
differences between AG+ and AG- chimpanzees, several previous studies 
have revealed significant associations. AG+ and AG- chimpanzees refers 
to those individuals who were classified as reliably producing attention- 
getting sounds or not. For example, using a region-of-interest approach, 
we have previously found that right-handed chimpanzees for pointing or 
other manual gestures such as clapping show larger leftward asymmetries 
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in the volume of the IFG and, to a lesser extent the planum temporale, com- 
pared to non-right-handed individuals and these handedness effects were 
not found in other brain regions (Hopkins and Nir, 2010; Meguerditchian et 
al., 2012; Taglialatela et al., 2006). More recently, Bianchi et al. (2016) us- 
ing voxel-based morphometry analyses, have reported that AG+ chimpan- 
zees have increased gray matter volume in the left ventrolateral premotor 
cortex and rightward asymmetries in dorsal prefrontal cortex. In a related 
study but using a different approach, Hopkins et al. (2017a) compared the 
depth of the central sulcus along the dorsal-ventral plane between AG+ and 
AG- chimpanzees and found significant differences in asymmetries in the 
ventral but not dorsal portion of this fold, particularly among males. The 
ventral portion of the central sulcus (CS) corresponds to the motor regions 
that control oro-facial and laryngeal movements and lies below the elbow 
that defines the inferior border of the motor hand area of the precentral 
gyrus (BA6) (Bailey et al., 1950). 


3.2 Tool Use Handedness and Skill 


One theory on the evolution of language and speech pertains to the emer- 
gence of tool manufacture and use (Bradshaw and Rogers, 1993; Gibson 
and Ingold, 1993; Greenfield, 1991; Stout and Chaminade, 2012). Basi- 
cally, this theory proposes that human speech evolved by co-opting neural 
systems that were initially involved in praxic functions, notably the manu- 
facture and use of tools. In humans, there is clinical and experimental data 
evidence for shared neural systems under lying both praxic and verbal 
functions including the inferior frontal gyrus, inferior parietal lobe and 
posterior and middle temporal gyri (Frey, 2008; Johnson-Frey, 2004; Lewis, 
2006; Roby-Brami et al., 2012; Stout and Chaminade, 2012). Further, some 
clinical studies have shown both apraxia and aphasia exists when lesions 
occur within the IFG or adjacent cortical regions, particularly in the left 
hemisphere (Kobayashi and Ugawa, 2013). 

Previous studies by our group have shown that individual variation in hand 
use and tool use skill is associated with different aspects of cortical organi- 
zation in the IFG in chimpanzees (Hopkins et al., 2012b) . Our group has 
characterized hand preference and performance asymmetries on a probing 
tool use task that is designed to simulate termite fishing in wild chimpan- 
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zees (Hopkins et al., 2009). Although population-level handedness for this 
tool use task is not evident in our chimpanzee sample, we have found that 
right- and left-handed individuals differ in the linear length of the FO sulcus 
(Hopkins et al., 2007b). Our group has also found that performance asym- 
metries on the tool use tasks are associated with the volume of the IFG and 
the depth of the central sulcus, particularly in regions corresponding to the 
motor-hand area of the precentral gyrus or KNOB (Hopkins et al., 2017b). 
It is also important to note that in wild chimpanzees, there are several inter- 
esting observations regarding tool use as it may relate to brain asymmetries 
within the IFG and related areas. First, different types of tool use tasks seem 
to elicit strong individual hand preferences with a majority of individuals 
showing a clear bias (i.e., there are very few ambidextrous individuals), which 
differ somewhat from data in captive chimpanzees (Hopkins et al., 2017b; 
Hopkins et al., 2009). Second, though directional biases vary as a function the 
type of tool use task, population-level handedness is evident for several meas- 
ures such as ant-dipping, termite-fishing, pestle-pounding, wadge-dipping but 
not nut-cracking (reviewed in Hopkins, 2013; Sanz et al., 2016). Thus, with 
respect to the theories linking the evolution of praxic function and functional 
asymmetries, extant data on tool use clearly support this view. 


3.3 Some New Data and Findings 


As a means of further examining the influence of handedness for tool use, 
manual gestures and the use of AG sounds on variation in cortical folding, 
here I present some new findings on analyses of asymmetries in surface area, 
mean depth and gray matter thickness in the three sulci that define the inferior 
frontal gyrus of chimpanzees, notably FO, PCI and IFS. For each sulcus, we 
used the pipeline procedures in BrainVisa to extract the folds as we have done 
in previous studies (Hopkins et al., 2010; Hopkins et al., 2017a; Hopkins et 
al., 2014). The sulci were manually labeled and we subsequently obtained 
measures of their surface area, mean depth and gray matter thickness for the 
left and right hemispheres. Asymmetry quotients for each sulci and measure 
were derived following the formula [AQ=(R -L)/ ((R + L) *.5)] where R and 
L reflect the right and left hemisphere values. Positive AQ values indicated 
rightward asymmetries and negative values indicate leftward. 
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3.3.1 Manual Gesture and AG Sound Production 


Figure 5: a. Picture of a chimpanzee gesturing to another individual; b. Mean AQ 
scores (+/- s.e.) for left-, ambiguously- and right-handed chimpanzees for 
surface area, mean depth and gray matter (GM) thickness. Negative AQ 
values indicate leftward asymmetries and positive indicate rightward 
biases. 
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For this analysis, I examined the influence of AG sound production and 
handedness for manual gestures on AQ scores for each measure using a 
mixed model analysis of covariance. AQ scores for each sulcus were the re- 
peated measures while handedness (left-handed, right-handed, ambiguously- 
handed) and AG group (AG+, AG-) were the between group factors. Age 
(in years) was the covariate. For surface area, a main effect for gesture 
handedness was found (F(2, 238) = 3.06, p = .049; see Figure 5). Right- 
handed individuals showed increased leftward asymmetries compared to 
the ambiguously-handed and left-handed apes. No other significant main 
effects of interactions were found. For mean depth, no significant main 
effects or interactions were found. For gray matter thickness, a significant 
main effect for handedness was found (F(2, 238)=3.07, p = .048) and, as 
was the case with the surface area measures, right-handed chimpanzees had 
greater leftward asymmetries than ambiguously-handed and left-handed in- 
dividuals (see Figure 5). A three-way interaction between sex, vocal group- 
ing and region was also found (F(2, 474) = 9.327, p = .001, Figure 6). 
Post-hoc analysis indicated that for FO, no significant differences were 
found between AG groups and sex. For IFS, females AG+ chimpanzees 
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had significantly greater leftward asymmetries than AG- females and AG+ 
males. In contrast, for the PCI, AG+ males had significantly greater left- 
ward asymmetries than AG- males. Thus, differences between sex and AG 
grouping asymmetries in gray matter thickness varied primarily between 
the PCI and IFS sulci, respectively. 


Figure 6: Mean gray matter thickness AQ scores (+/- s.e.) for male and females 
AG+ (vocal+) and AG- (Vocal-) chimpanzees for the (a) FO (b) IFS 
and (c) PCI sulci. 3D rendering adjacent to each graph depict the 
sulci measured in each analysis. Negative AQ values indicate leftward 
asymmetries and positive indicate rightward biases 
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3.3.2 Tool Use Skill 


For these analyses, our group quantified asymmetries in motor skill on 
a tool use task designed to simulate ant- or termite-fishing in wild chim- 
panzees. Briefly, and as shown in Figure 7, food (usually mustard, honey 
or salsa) is placed inside an opaque PVC pipe that is attached to the 
outside mesh of their home cages. On the end of the PVC pipe facing the 
chimpanzees, there is a small opening (1 cm in diameter) into which the 
chimpanzees can insert a lollipop stick in order to obtain the food. The 
lollipop stick is slightly smaller than the hole (9 mm in diameter) thereby 
increasing the motor and spatial demands of inserting the sticking the 
hole. In previous studies, we have recorded hand use and the latency from 
the initiation of a probing response to the successful insertion of the stick 
on 50 trials in more than 200 chimpanzees on this task. In these previous 
studies, the chimpanzees were free to use whichever hand they preferred; 
however, for the analysis presented here, I focused on assessment of asym- 
metries in hand skill in which we controlled for the number of dipping 
responses for each hand. Specifically, in 90 chimpanzees, we measured 
the average latency to successfully insert the stick on 30 trials for each 
hand (see Phillips et al., 2013). Thus, to more fairly assess motor skill 
asymmetries, we obtained an equal number of dipping responses for each 
hand and calculated a difference scores based on the mean latency scores 
for each hand. Based on the sign of the difference score, chimpanzees 
were classified as performing better with their right or left hand. We then 
compared the AQ scores between the two handedness groups (based on 
their asymmetries in performance) using a mixed model ANCOVA which 
included vocal group (AG+, AG-) and sex (male, female) as additional 
between group factors and age as a covariate. 
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Figure 7: (a) Mean gray matter thickness AQ scores (+/- s.e.) for PCI in male and 
females chimpanzees that perform better with their right or left hand 
on the tool use task (illustrated in the left panel) (b) Mean gray matter 
thickness AQ scores (+/- s.e.) for PCI in male and females AG+ (vocal+) 
and AG- (Vocal-) chimpanzees (behavior illustrated in left panel). 
Negative AQ values indicate leftward asymmetries and positive indicate 
rightward biases 


Mean AQ (+/- s.e.) 


(b) 


Females Males 


For surface area and mean depth AQ scores, no significant main effects of 
interactions were found. In contrast, for gray matter thickness, a border- 
line significant three-way interaction was found between sex, sulcus and 
performance asymmetry group (F(2, 162) = 2.61, p < .07) together, as we 
found before, with a significant three-way interaction between sex, sulcus 
and vocal group (F(2, 162)=6.33, p < .003). To explore these interactions 
more thoroughly, univariate analyses of covariance were performed on the 
gray matter AQ scores for each sulcus with vocal group, tool performance 
group and sex serving as between group factors while age was the covari- 
ate. No significant main effects or interactions were found for FO and IFS 
but, for PCI, significant two-way interactions were found between sex and 
performance group (F(1, 82) = 4.586, p < .036) and sex and vocal grouping 
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(F(1, 82) = 8.747, p < .005). The mean gray matter thickness AQ scores 
for PCI in males and females within the tool performance asymmetry and 
vocal group can be seen in Figure 7a & b. As can be seen, the pattern of 
sex-dependent variation in AQ scores within PCI as a function of tool use 
and AG sound production are similar. 


4. Discussion 


From a purely gross anatomical standpoint, as has been articulated pre- 
viously, there are clear differences in the cortical folding patterns of the 
inferior frontal gyrus between humans and chimpanzees and, indeed, all 
great apes. The sulcal landmarks used to define the Pars opercularis appear 
to be highly conserved between humans and other apes. Moreover, when 
comparing patterns of asymmetry in total or gray matter volume within 
the IFG between humans and chimpanzees, the data seems fairly consistent 
between species with neither showing a population-level bias. However, for 
the Pars triangularis, apes lack a horizontal ramus that serves as the inferior 
boundary for this anatomical region. Thus, there is some increased folding 
and gyrification within the ventral-lateral premotor and, indeed, the entire 
prefrontal cortex in humans compared to apes. From the cytoarchitectonic 
studies, the data also suggest that the volume of Area 44 and Area 45 in 
humans is between 6 and 7 times larger than the comparable regions in 
the chimpanzee brain (Schenker et al., 2010). In short, morphological and 
cellular changes in Broca’s area have evolved in humans after the split from 
the common ancestor with chimpanzees and this likely reflects increased 
cortical expansion and connectivity in response to selection for motor and 
cognitive demands associated with praxic and communicative functions. 
In chimpanzees, significant associations are found between motor and 
communicative functions and different aspects of cortical organization 
within the inferior frontal gyrus. With respect to communication, differ- 
ences are found between individuals who prefer to gesture with their right 
or left hand in asymmetries in (1) gray matter volume (2) surface area and 
gray matter thickness of the FO, PCI and IFS sulci and (3) depth of the 
central and middle portions of the central sulcus. Associations between 
handedness and asymmetries in these same regions are not evident for 
actions that are not communicative in function (i.e., simple reaching), the 


170 William D. Hopkins 


exception being tool use (see below). At face value, these results are consist- 
ent with the gestural origins theory of language evolution, at least at the 
neurological level of analysis. 

AG+ and AG- chimpanzees also differ in cortical organization within 
the IFG. Notably, Bianchi et al. (2016) found that AG+ chimpanzees had 
a small but significant difference in gray matter density within the left 
ventro-lateral prefrontal cortex and right dorsolateral prefrontal cortex 
compared to AG- apes. It is worth noting that the gray matter cluster within 
the left ventro-lateral prefrontal cortex reported by Bianchi et al. (2016) 
is located at the most medial point of the PCI. As reported here, AG+ and 
AG- chimpanzees, particularly males, show greater leftward asymmetries 
in gray matter thickness within the PCI fold. Recall that PCI sulci define 
the posterior border and include Area 44 cells (see Figure 2). Further, the 
gyrus immediately posterior to PCI is the ventral portion of the precentral 
gyrus. Previous studies by our group have found leftward asymmetries in 
the depth of the ventral portion of the central sulcus in AG+ but not AG- 
chimpanzees (Hopkins et al., 2017a). Thus, the emerging and convergent 
data strongly suggest that variability in cortical activation, cortical folding, 
gray matter thickness and volume are associated with AG sound production 
(and perception) by chimpanzees and possibly more distantly related Old 
World monkeys (Coude et al., 2011; Petrides et al., 2005; Romanski et al., 
2004; Taglialatela et al., 2008, 2011). 

One interesting observation of the PCI sulci that has been discussed in the 
literature is the degree of variability in folding across subjects (Sherwood 
et al., 2003). Specifically, there are sometimes more than one PCI branch 
that’s extends vertically from the IFS and this can influence measurement 
of the volume of the IFS. In our labeling of PCI from the BrainVisa extrac- 
tions, if PCI bifurcated, we included all these additional folds in the analysis. 
However, it would be interesting to assess whether the complexity of folding 
for PCI is associated with any of the communicative or praxic functions 
we have quantified in the chimpanzees. In humans, for some sulci such 
as the anterior cingulate, variability in the sulci patterns are predictive of 
localization of individual motor functions (Amiez and Petrides, 2014) and 
this type of approach could potentially provide some additional and novel 
findings that are not captured with our existing methods. 
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Like AG sound production, previous studies have found that chimpan- 
zees that perform better on a tool use task with their right hand show great- 
er leftward asymmetries in the IFG compared to those that perform better 
with their left hand, particularly among males (Hopkins et al., 2017b). 
Interestingly, hand preferences did not account for a significant proportion 
of variance in IFG asymmetry; thus, asymmetries in manual skill prefer- 
ence were more strongly linked to variation in IFG asymmetry. Analysis 
of the surface area, mean depth and gray matter (GM) thickness data for 
the FO, IFS and PCI reported here further reinforce these findings. When 
controlling for the number of right and left hand responses, we found that 
males who perform better with their right compared to left hand showed 
greater leftward asymmetries in PCI. This finding is quite consistent with 
the findings for AG sound production. 

In my view, what links the neuroanatomical findings from gesture, tool 
use skill and AG sound production is that all these actions require voluntary 
motor control and planning. As noted earlier, behavioral studies clearly 
show that the use of manual gestures and AG sounds are intentional and 
referential and, in many ways, one can consider the function and use of AG 
sounds as a “gesture”; indeed, the use of AG sounds as gestures is context 
and modality specific but functions similarly to gestures (i.e., to draw the 
attention of a human experimenter to an object). It should be noted that AG 
sound production and gestures often co-occur, further suggesting that com- 
mon neural systems likely underlie their expression (Hopkins and Cantero, 
2003). The tool use testing we have done in our chimpanzees also requires 
hand-eye coordination and planned actions given the temporal and spatial 
constraints of the task; thus, this task also requires complex planning and 
execution of multiple motor systems. In addition, tool use, gesture and AG 
sound production share a common cognitive foundation in the form of 
means-ends reasoning (Hopkins et al., 2012a; Leavens et al., 2005b). For 
example, for tool use in wild chimpanzees, many forms are exhibited on 
the context of obtaining food that is otherwise unavailable (i.e., ants in a 
termite mound, meat inside a nut, water inside the truck of a tree, etc...). 
In captivity, most chimpanzees produce gestures or AG sounds to solicit 
the behavior of a human that has access to a food they see but is other- 
wise unavailable. Thus, in the captive setting, the chimpanzees are using 
the human experimenter effectively as a tool (i.e., social tool use) and, if a 
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physical tool were available, they would likely use it instead of the human 
(see Volter et al., in press for discussion). 

One important question that comes from finding on the association be- 
tween Broca’s area and communication and tool use skills is the extent to 
which common biological or genetic mechanisms account for unique or 
shared contributions to these brain-behavior associations. Specifically, our 
group has previously reported that manual gestures, AG sound produc- 
tion and tool use hand preference and skill are all heritable in chimpanzees 
(Hopkins, 2013; Hopkins et al., 2013; Hopkins et al., 2015b; Taglialatela 
et al., 2012). Similarly, we have found that gray matter volume of the IFG 
and planum temporale are significantly heritable in chimpanzees (Hopkins 
et al., 2015c). Whether common genes underlie both gray matter variation 
in the IFG and either tool use, manual gestures or AG sound production 
remains unclear but should be tested in future studies. 

Another important convergent set of the results that seem to exist in our 
chimpanzee sample is the sex dependent brain-behavior associations, par- 
ticularly as it relates to asymmetry. Males appear to show more consistent 
and robust brain-behavior associations compared to females. This is the 
case for the association between asymmetries in the PCI and AG sounds 
production and tool use performance asymmetries (Figures 7a & 7b). The 
stronger brain-behavior associations in males were also reported in relation 
to depth in the central sulcus and AG sound production (Hopkins et al., 
2017a). Why this is the case is not clear but we have previously found that 
variation in corpus callosum morphology and fiber count differ between 
male and female chimpanzees in relation to neuroanatomical asymmetries 
in the planum temporale (Hopkins et al., 2016). Therefore, one possible 
explanation maybe that males, as a whole, show more pronounced asym- 
metries and these align themselves with individual differences in behavior 
more so than females. 

Though Broca and many early investigators focused on the role of the 
inferior frontal gyrus in language functions, it is now clear that this region 
plays a role in a variety of cognitive and motor functions outside of the 
domain of language. For instance, the IFG is one brain region constituting 
the mirror neuron system, which has been implicated in the perception and 
production of imitation and related action-perception processes (Fazio et al., 
2009; Kilner et al., 2009; Makuuchi, 2005). Further, there is evidence that 
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the IFG, particularly within the right hemisphere plays an important func- 
tion in self- and cognitive control (Kawashima et al., 1996; Konishi et al., 
1999; Miller, 2000). There is also good evidence the apes exhibit imitation 
recognition and can be taught basic “do-as-i-do” imitation tasks (Custance 
et al., 1995; Haun and Call, 2008; Nielsen et al., 2005; Pope et al., 2015) 
and recent findings using DTI suggest important differences in connectivity 
between Broca’s area and the parietal and temporal cortex may underlie 
phylogenetic differences in social learning, including imitation (Hecht et al., 
2013a; Hecht et al., 2013b). Similarly, a recent paper by Hecht et al. (2017) 
found that individual difference in cortical connectivity within Broca’s area 
was associated with individual differences in mirror self-recognition in chim- 
panzees. Thus, the extent that the IFG plays a role in cognitive functions other 
than tool use and communication warrants further investigation. 

Finally, with respect to language, Broca took a localization or phrenol- 
ogy view with the notion that a single brain region played a specific and 
necessary role in a given function. We now know that a number of cortical 
and subcortical brains regions, in addition to the IFG, represent a circuit of 
connected areas that subserve language and speech as well as praxic func- 
tions such as tool use and tool making (Belton et al., 2003; Enard, 2011; 
Frey, 2008; Lewis, 2006; Lieberman, 2007; Stout and Chaminade, 2012; 
Vargha-Khadem et al., 2005). Here, we focused primarily on different as- 
pects of cortical organization within the IFG but clearly future studies need 
to consider additional brain regions either alone or in conjunction with the 
IFG to gain a fuller understanding of the morphological correlates of tool 
use and communication skills in primates, including humans. 
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Table 1: Descriptive Data on Area 44 and Area 45 in Humans and Chimpanzees 


Measure 


Females 
Left 


Volume 


Humans 
Area 44 
Area 45 


Chimpanzees 
Area 44 


Area 45 


5024 (1661) 3281 (1162) 


3155 (653) 


625 (115) 
519 (76) 


Neuron Number 


Humans 
Area 44 
Area 45 


Chimpanzees 
Area 44 


Area 45 


Neuropil Space 


Human 
Area 45 


Chimpanzee 
Area 45 


95.4 (19.1) 
68.2 (17.7) 


9.98 (1.29) 


8.55 (1.59) 


.278 (.020) 


215 (.022) 


4137 (1238) 


544 (63) 


601 (77) 


64.0 (13.2) 


79.0 (15.9) 


11.40 (1.27) 


9.93 (0.91) 


.267 (.016) 


179 (.017) 


2909 (652) 
2859 (341) 


577 (34) 


483 (88) 


67.2 (12.3) 


74.6 (10.0) 


9.88 (2.30) 


8.93 (1.47) 


.316 (.020) 


228 (.019) 


2036 (634) 
2218 (225) 


752 (70) 


666 (96) 


47.8 (13.4) 


54.8 (6.9) 


8.55 (1.99) 


9.35 (0.73) 


.328 (.016) 


.226 (.015) 


Volume measures are in mm’. Neurons number (x 10°) 
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From Animal Communication to Linguistics 
and Back: Insight from Combinatorial 
Abilities in Monkeys and Birds 


Abstract: For several decades, ethologists and comparative psychologists have been 
using a linguistic terminology to discuss complex communicative abilities in animals, 
with a particular focus on sound combinatorial rules. One historical example is 
the possible syntactic ability of songbirds. More recently, context-dependent call 
combinations have been described in nonhuman primates. This time, the detailed 
observational and experimental data gathered in this area has even drawn the atten- 
tion of linguists and has given rise to studies highlighting the relevance of linguistic 
tools for the study of nonhuman primate communication systems. However, the 
parallels that can be drawn between humans’, birds’ and nonhuman primates’ 
verbal/vocal combinations still remain the topic of intense debate possibly because 
mismatches between the terminologies used have confounding effects. The question 
is: can we go beyond the traditional dichotomy between phonological and lexical 
syntax to characterize the diversity of sound combinatorial rules found in animals? 
Here, we will adopt a two-step approach in order to discuss: (1) what forms sound 
combination takes in animals, based on structural and functional criteria and when 
it may or not be appropriate to use linguistic terms; (2) why sound combination 
may have evolved in some species more than others. We will notably illustrate our 
arguments with recent findings in some cooperative breeding birds and guenons, 
where cases of meaningful sound compositionality have been recently described. 


Keywords: sounds combinations, verbal combinations, animal communication 


1. Introduction 


The communicative abilities of animals have received considerable research 
interest over the years. One likely contributing factor is that animal com- 
munication constitutes a particularly fruitful substrate for comparative 
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analyses with human language. Language, in its full blown form is a prod- 
uct of a variety of communicative and socio-cognitive capacities!, This 
complexity, together with its relatively short window of emergence?!, has 
led to the suggestion that, as with other complex biological phenomena, 
language might have evolved from pre-existing capacities and structures 
initially serving other functions!’ !, One way to shed light on the potential 
evolutionary path leading to the emergence of language is to decompose 
it into several core features and to explore their presence and role in non- 
human animals from various taxa'®7!. This approach, using animals from 
distinct and phylogenetically distant taxa is particularly relevant in helping 
disentangle the relative influence of various factors involved in the evolu- 
tion of complex communicative abilities culminating in human language. 
One relevant historical example is that of capacities for vocal learning, 
which have been described in some birds (e.g. parrots, starlings and mock- 
ingbirds) and mammalian species (e.g. some bats and marine mammals)!*?! 
in which individuals are able to acquire new vocalisations. Several stud- 
ies provide convincing evidence for the importance of social learning and 
auditory feedback for call and song acquisition’! which are central to 
language acquisition in humans. In addition, some studies even reported 
the cultural transmission of vocal dialects in several bird and mammal 
species!!*161, Given, the distribution of vocal learning species in distant 
taxa these capacities are assumed to result from convergent evolution"”"), 
Moreover, the study of the neural substrates and genetics of species that 
possess vocal learning capacities, compared with humans and with non- 
learning species (e.g. gull, dove, suboscines) has generated hypotheses for 
the emergence of vocal learning!'**"), For instance, comparative studies of 
the structure and expression patterns of the now famous FoxP2 gene, have 
shed light on its potential role in the emergence of vocal learning in the 
human lineage: using data from comparative studies and clinical studies 
in human, some authors proposed that ancient neural functions of FoxP2 
have been co-opted to subserve aspects of vocal communication, and nota- 
bly vocal learning, in several species including humans’, This example 
demonstrates the utility of broad comparative studies to clarify particular 
aspects of the evolution of communication. However, the study of FOXP2 
in isolation is not sufficient and other factors are likely involved and have 
to be explored regarding vocal learning in animals'!, For instance, some 
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so-called ‘non-learners’ have recently been shown to display significant 
abilities in vocal plasticity (e.g. elephant?*!, goat”, marmoset!*!, guenons 
[29-31], gibbons!**3!) questioning the relevance of the traditional learner/ 
non-learner dichotomy. 

Several other capacities involved in language have been studied and 
described to various extents in non-human species, including many non- 
human primates. For instance, various species display the capacity to 
produce intentional signals?+3¢! or semantic-like signals (also termed ref- 
erential signals i.e. that refer to an external object of the word)!°37*I, 
Also, vocal exchanges in some species are strictly organised and display 
“conversation-like” properties (i.e. based on call overlap avoidance and 
turn-taking between exchanging partners'*4*5!), Furthermore, the capacity 
to combine sounds into complex structures has long been a topic of much 
contention‘!, Indeed, combinations of vocal units have been extensively 
reported in animals from various taxa, including numerous bird species 
(e.g. winter wren (Troglodytes troglodytes)'*”|, Bengalese finches (Loncura 
striata)*!, mockingbirds'*?!, European starlings (Sturnus vulgaris), several 
species of chickadees!*!“7], and blue-throated Hummingbirds (Lampornis 
clemenciae)'**!) as well as mammals such as rock hyraxes (Procavia capen- 
sis), several species of bats (i.e. mustached bats (Pteronotus parnellii); 
free-tailed bats (Tadarida brasiliensis)**°’|; sac-winged bats (Saccopteryx 
bilineata)**!), whales (humpback whales (Megaptera novaeangliae)*?!; 
Killer whales (Orcinus orca); pilot whales (Globicephala sp.)'!; sperm 
whales (Physeter microcephalus)') and non-human primates (e.g cotton- 
top tamarins (Saguinus Oedipus)"; gorillas (Gorilla sp.)!°; red-bellied 
titi monkeys (Callicebus moloch)!®”). Sound combinations occur in wide 
diversity of contexts, such as alarm contexts!**-”"!, socio-positive interac- 
tions!”!-73], mate attraction or territorial defence*”*”* and can take various 
forms. Some animals, for example, can merge acoustic units (i.e. basic ele- 
ment consisting of a continuous mark on a sonogram, also termed notes in 
birds) into combined calls (e.g. consisting of several units merged linearly 
with little to no silence between them, also termed motifs in birds). Fur- 
thermore, calls (simple and/or combined ones) can then also be combined 
into higher-order call sequences (i.e. series of calls uttered in sequence and 
separated by a silent interval always shorter than silent gaps between se- 
quences, also termed songs in birds)'%7477), 
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Combinatoriality is central to language, and drawing parallels between 
sound combinations in animals and the combinatorial systems of language 
is tempting. Language relies on the combination of sounds (phonemes, see 
Table 1 for a definition) into larger units (morphemes and words) which 
are themselves combined into larger utterances (sentences)!”*!, However, 
useful comparisons are hard to achieve partly because, contrary to most 
ethological definitions given for sound combinations, linguistic definitions 
often heavily rely on functional aspects and include element’s meaning or 
grammatical function as a way to characterise them!”**", 

Language’s generativity (i.e. capacity to generate an infinite number of 
ends using finite means) is a product of dual articulation'*+*7!, which allows 
combination at two distinct layers: phonology and morphosyntax'**! (Ta- 
ble 1). Phonology corresponds to the combination of meaningless sounds 
(i.e. phonemes) into meaningful elements (i.e. morphemes and monomor- 
phemic words). Simply put, a phoneme is a sound which, when added, 
deleted or used to replace another sound in a word, creates a phonemic 
contrast changing the meaning of the word. For example, in English the 
sounds /k/ and /b/ are phonemes as they differentiate the words ‘cat’ and 
‘bat’. Two words that differ only by one phoneme are termed ‘minimal pair’. 
Morphosyntax (Table 1), corresponds to the second layer of combination, 
in which meaningful elements (morphemes and words) are combined into 
larger structures whose meaning depends on the elements composing them 
and their order. Some words consist of only one morpheme (i.e. mono- 
morphemic words such as ‘happy’) but morphemes can also be combined 
together into polymorphemic words. For example, the word “happy” can 
be combined with the suffix “ness” to create the polymorphemic word 
“happiness” or with the prefix “un” to create ‘unhappy’. Finally, in phrases, 
morphemes are combined according to grammatical rules'*4!, These rules 
are a key contributor to language’s generativity: with a finite number of 
rules it is possible to generate, using a finite number of elements, an infi- 
nite number of structures among which rules distinguish well-formed (or 
grammatical) syntactic structures from ill-formed (or non-grammatical) 
syntactic structures!7*§5!, 
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Table 1: Main linguistic concepts and definitions to be used in this chapter 


Phoneme Smallest meaning differentiating sounds in a language'’®! 
i.e. meaningless sounds that allow to differentiate between two 
words. 

Morpheme A minimal unit of meaning or grammatical function'®). 

Phonology Combinatorial layer of language in which phonemes are 
combined to form morphemes and words!**!, 

Morphosyntax Combinatorial layer of language in which meaningful elements 


are combined into larger structures whose meaning depends on 
the elements composing them and their order. Morphosyntax 
includes both morphology, where morphemes can be 
combined into more complex structures (i.e. polymorphemic 
words), and syntax, where mono- and polymorphemic words 
are combined into sentences!*5.54, 


Dual articulation 


Characteristic of language whereby speech can be analysed 

at two complementary levels: phonology and morphosyntax. 
Duality of patterning has been characterised as a design-feature 
of language which is partly responsible for language’s virtually 
infinite generativity*!°7), 


Scalar 
implicatures 


Linguistic concept related to pragmatic inference. The core 
idea is that the utterance of a sentence S implicates the falsity 
of stronger alternatives (i.e. more informative ones) as for any 
stronger alternative S’ to S, a cooperative speaker would have 
used S’ rather than S if s/he believed S’ to be true'**"!. e.g. the 
sentence S “some of my trees are oaks” implies that not all 
my tree are oaks as, if all were, I would have used the more 
informative S’ sentence “all of my trees are oaks”. 


Peter Marler”, proposed to differentiate between animal combinatorial 


structures depending on their organisation and the likely meaning of their 


components. To this end, he used terms borrowed from linguistics and 


distinguished two main types of organization: phonological syntax (or 


phonocoding) and lexical syntax (or lexicoding). He defined phonological 


syntax as the concatenation of sounds without independent information 


content and which are not used singularly, or meaningful sounds that lose 


their original content when combined. He defined lexical syntax as the 


level at which meaningful elements are combined. Whilst Marler borrowed 


terms from linguistics, several important differences with human phonology 


and morphosyntax remain. First, the concept of ‘meaning differentiation” 
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(i.e. change in meaning of the whole resulting from change in one of its 
meaningless elements) is absent from Marler’s definition of phonological 
syntax. In addition, contrarily to phonology, Marler’s phonological syntax 
allows the combination of meaningful units. Finally, the definition he pro- 
posed for lexical syntax — though closer to morphosyntax than phonological 
syntax is to phonology- differs strikingly from its ‘human’ counterpart as 
the importance of elements’ order on combined call meaning is overlooked 
while the order of words and morphemes are of central importance in most 
human utterances. 

Recently, the study of combinatorial systems in animals has received 
renewed interest from researchers from various fields including ethology, 
linguistics and psychology!****85!, A multitude of studies have illuminated 
new perspectives and subsequently given rise to the development of in- 
terdisciplinary work'®*»3] as well as prompting controlled experiments 
investigating combinatorial structures found in the natural communication 
of animals and their relevance to receivers!®?)”***!, Within this framework, 
some authors have questioned the relevance of the definitions proposed 
by Marler notably because the joint use of the terms ‘phonological’ and 
‘syntax’, which correspond to very distinct linguistic concepts, is mislead- 
ing!®l, We concur and forward that further reflection is now necessary in 
order to develop an accurate terminology to characterise the structure and 
functional aspects of sound combination in animals. In addition to provid- 
ing the groundwork facilitating understanding of animal combinatorial 
systems, this will best serve comparative analyses with language. 


This chapter therefore aims to review comparative work on sound combina- 
tions in humans and animals, with two intended outcomes: 


— to propose a basis for future interdisciplinary work aiming to develop a 
more appropriate terminology, and shed light on some potentially fruit- 
ful prospects for future studies of sound combination. 

— To elucidate evolutionarily relevant factors likely to have influenced the 
development of combinatorial communication systems. 


We will firstly turn our attention to the variety of combinatorial structures 
found in primates and birds. We will (1) examine possible bases to define 
rudimentary parallels with sound combination in language and (2) review 
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recent empirical studies providing convincing evidence for combinatorial 
capacities parallel to language in animals. In the third part of this chapter 
we will focus on the recent advances brought by the use of formal linguistic 
analyses on animal communication systems. Finally, in line with the com- 
parative rationale adopted at the beginning of this chapter we will build 
on the taxonomic diversity of examples described as a way to formulate 
potential hypotheses regarding why combinatorial systems emerge. 


2. Combinatorial systems: diversity and terminology 


This section reviews various animal combinatorial systems and evaluates 
the terminology used to characterize them. Given the key role played by 
meaning in combinatorial systems, particularly when comparing combina- 
torial structures with those in human language, providing a clarification is 
important. We will use the term meaning in a form approaching Gricean 
natural meaning’”"”*!, i.e. as the significance/information that receivers de- 
rive from a signal and its context (because of its regular association with 
a given event, individual or object), without assuming emitter’s intention 
to inform others?! 


2.1 Parallels with phonology? 


Many bird species rely on the combination of apparently meaningless units 
into larger structures. However, studies describing such systems often lack 
information on the contextual correlates of the combinatorial variants emit- 
ted'?10l, Furthermore, experiments testing a potential intrinsic meaning 
of single units (or changes in meaning accompanying changes in the type 
or order of units combined) are often missing'4*!°71°3!, Some experiments 
in songbirds have shown that receivers’ reaction can be influenced by unit 
diversity", fine acoustic structure !, or by the simultaneous modification 
of several frequency and temporal parameters 04 byl104], suggesting that 
information (about caller’s quality, or identity) is conveyed. Nevertheless, 
as frequently suggested by behavioural observations, such modifications, as 
well as changes in unit’s type or order, do not seem to alter the main func- 
tion or “semantic content” of the sequence (i.e. mostly social bonding, mate 
attraction and/or territorial defence in the case of songbirds!474?.74861041), 
As a result, any parallel with the phonological layer of language is lim- 
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ited at best and subsequently it has been argued that such systems may 
be better described in terms of “phonetic patterning”, that relates to the 
physical properties of sounds but does not characterise sounds as meaning- 
differentiating!**!, 

Previous work in non-human primates, particularly gibbons has sug- 
gested possible additional parallels with phonological organisation seen 
in language!!®”, An observational study on white-handed gibbons (Hylo- 
bates lar), for example, indicated that their communication, like that of 
songbirds, relies on the combination of apparently meaningless units into 
sequences. However as far as we know, in contrast to songbirds, gibbons 
give two types of sequences that are associated with strikingly distinct 
contexts and functions: one is produced routinely in the morning, while 
the other functions to signal the presence of a predator!°”!, In both con- 
texts, these sequences are given in duets during which two partners pro- 
duce song in a coordinated way but the organisation of sequences differs 
between morning duets and predatory ones. More precisely, morning and 
predatory duets differ in three ways: (1) in the proportion of one type of 
note (the ”hoo” note, with on average 100 vs 10 “hoo” notes introduc- 
ing predatory induced songs and morning duets respectively), (2) in the 
order of motifs involved (female-specific calls is given later and answered 
slower by her male partner in predatory contexts) and (3) in the pres- 
ence of two note types as (i.e. ‘learning-wa’ notes are globally absent from 
predator-induced songs while ‘sharp wow’ notes are absent from morning 
duets). Moreover, natural observations indicate that wild individuals react 
differently to the distinct sequences suggesting that the structuring of the 
signal encodes information. Further experimental work is now required to 
clarify how. Playback experiments comparing receivers’ reaction to natural 
sequences and artificial stimuli in which the order, proportion, and type 
of notes given are manipulated will be particularly necessary to identify 
what receivers use to discriminate between sequence types. In addition, 
further clarifications about whether notes’ acoustic structure varies between 
contexts and whether they possess an intrinsic meaning (notably ‘learning 
wa’ and ‘sharp wow’ notes) could be obtained using acoustic analyses and 
playbacks. Such information would help determine the nature of the system 
(i.e. showing parallels with phonology, morphology or neither) and may 
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also shed light on the possible cognitive processes underlying communica- 
tion in this species. 

Thus, although previous studies on birds and primates reviewed above 
match some of the criteria used to define phonological combination in 
language, none of them did so fully, primarily because the demonstration 
of changes in message according to sequence organisation (i.e. meaning- 
differentiation) was lacking or because the intrinsic meaning of notes was 
unclear. We propose that convincing evidence for parallels with phonol- 
ogy in animals would require: (1) a combination involving units that are 
not associated to any particular behavioural context (hence from which 
receivers could not individually extract specific information about the en- 
vironment, or caller’s behaviour). (2) that the combination (or addition) 
of given “meaningless” units in a given order creates a signal which can be 
reliably associated with one (or several) external events or indeed a caller’s 
behaviour(s)[96] and critically (3) that changes in unit order or composition 
triggers changes in signals’ content. Finally, to parallel in a rudimentary 
way the productivity of language, we would also expect such systems to 
involve the reuse of units across distinct types of utterances. 


2.2 Parallels with morphology? 


The second layer of language, morphosyntax, relies on the combination of 
meaningful sounds into larger structures whose meaning depends on their 
components and organisation. Several studies have described vocalisations 
composed of apparently meaningful calls but here, again, the parallel with 
the morphosyntactic organisation of language is not always clear. A series of 
studies investigating gorilla communication has described a potential com- 
binatorial system in a great ape species'®*!°*!, Both mountain and Western 
gorillas possess a graded repertoire composed of five main types of close 
calls. Each type of call can be given alone or combined with every other 
close call unit in non-random ways. The authors analysed the contextual 
correlates of emission for three types of units and their most common com- 
binations: atonal grunts (A1), short tonal grunts (T2) and grumbles (T4) as 
well as A1-T4 and T2-T4 combinations. The results show that, while T4 is 
given more in foraging contexts, in particular when there is no individual 
less than 5 meters from the emitter, A1 and T2 are associated with resting 
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contexts, notably when other individuals are around the emitter (i.e. <5m) 
and do not differ in their context of emission. A1-T4 and T2-T4 combina- 
tions are given in the same context as A1 and T2 calls but, in contrast to 
single calls, combinations are associated strongly with vocal exchanges. 
These results suggest that in this system T4 units, which may serve as a 
« localisation » call due to their longer duration, can be added to A1 or T2 
units (whose « normal » context of emission is thus respected) during vocal 
exchanges. However, whether combinations triggers changes in the infor- 
mation content of the calls remains unknown, because receiver’s reaction 
to single and combined units have not been tested and, more importantly, 
because the contextual correlates of the vast majority of combinations given 
by gorillas (more than 150 different types'®‘!) have not been investigated yet. 
In addition, the role of repetition and call order in combined vocalisations 
that seem to vary greatly remains poorly understood", 

A series of studies on the alarm call system of male putty-nosed mon- 
keys also revealed an intriguing system which relies on the combination of 
calls that appear to carry meaning!!°?""9!, Indeed, male putty-nosed mon- 
keys use two distinct loud calls « Pyow » and « Hack ». A first series of 
studies using natural observations, playbacks and predator presentation 
experiments suggested that sequences of « Pyows » were regularly given to 
leopards, while sequences of « Hacks » as well as transitional Hack series 
(i.e. several Hacks followed by several Pyows) were common responses to 
crowned-hawk eagles. Interestingly, Pyow-Hack sequences (i.e. 1-4 Pyows 
followed by 1-4 Hacks) reliably trigger movement (both natural sequences 
and sequences artificially composed of calls given in other contexts). The 
relationship between the apparent meaning of Pyow-Hack sequences and 
their components has raised questions and four main interpretations have 
been proposed!**!!; (1) a phonological interpretation in which Pyow and 
Hack would work as « phonemes » i.e. allowing differentiation of meaning 
of single units (Pyow, Hack) and of their joint use (Pyow-Hack sequence)!**|, 
(2) an idiomatic interpretation in which the original -compositional- mean- 
ing of Pyow-Hack sequence was blurred, similarly to human idioms (e.g. it’s 
raining dogs and cats)!®*117114] and two more « semantic » interpretations: 
(3) one in which Pyow and Hack would respectively carry the meaning 
« move on the ground » and « move in the air » while Pyow-Hack se- 
quence would carry a combined general meaning « we move, let’s go » 
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as putty-nosed monkeys occupy various strata at a time and can travel on 
the ground as well as in the canopy'**! and (4) an interpretation based on 
weak meanings from Pyow (i.e. underspecified, general alarm) and Hack 
(i.e. non-ground movement or high arousal depending on the analysis) and 
inferences based on the pragmatic principles of competition and influence 
of contextual cues!'!!4l, Now, further investigation of the possible mental 
representations triggered by conspecific calls as well as putty-nosed mon- 
keys’ capacities to handle and understand combinatorial structures more 
generally are necessary to determine which of these interpretation (or oth- 
ers) is most plausible. 

The examples reviewed above show that combining meaningful calls into 
larger structures (either combined calls or call sequences) is not sufficient to 
offer a robust parallel with the morphosyntactic organisation of language. 
In particular, we argue that to be considered as a rudimentary parallel with 
morphosyntax, a system would obviously need to (1) involve the combina- 
tion of vocal units, from which receivers can individually extract informa- 
tion, into a larger structure. It would also need that the information content 
changes depending on and reflects (2) the units merged together and their 
respective content and (3) rules for unit combination (i.e. systematic order 
of combination and consistent alteration of the information conveyed by 
signal). 


3. Focus on promising examples: the cases of babblers 
and guenons 


3.1 Parallels with phonology 


To date and to our knowledge, only one study documenting note combina- 
tions in chestnut-crowned babblers has provided convincing evidence for 
a parallel with the phonological layer of language. In this study, Engesser 
et al.?6 combined natural observations, acoustic analyses and playback of 
natural and artificially recombined sounds in chestnut-crowned babblers. 
These cooperatively breeding birds living in arid areas of South-Western 
Australia possess a vocal repertoire of discrete calls, most of which are com- 
posed of apparently meaningless notes. Critically, some notes are reused 
across call types, such as the ‘A’ and ‘B’ notes that can be combined together 
into an ‘AB’ structure during flight (i.e. flight call) and a ‘BAB’ structure 
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during nestling provisioning (i.e. prompt call) (Figure 1). Acoustic analyses 
showed no difference in notes’ structure between call types. 


Figure 1: Spectrogram of double-element flight call (i.e. F1 F2) and triple-element 
prompt call (i.e. P1 P2 P3) of adult chestnut-crowned babblers. Figure 
reproduced from Engesser et al. (2015) 4. 


0 0.1 0.20.3 0.4 0.5 0.6 0.7 0.8 
Seconds 


Receivers’ reaction did not differ between natural and artificial stimuli 
(i.e. artificial flight calls created by deleting the first ‘B?’ unit of a prompt 
call and artificial prompt calls created by adding a ‘B” unit to a flight call) 
within a call type. In addition, the broadcast of single ‘B’ units and artificial 
‘CAB’ stimuli (‘C’ being a call element naturally given in combination with 
other notes by chestnut babblers) triggered surprised reactions that differed 
from those obtained by the broadcast of flight or prompt calls. These ad- 
ditional testing conditions thus ruled out a possible ‘priming effect’ of a ‘B’ 
element as well as responses being driven by superstructure effects. Thus, 
the flight call/ prompt call complex in CCBs seems to match the three key 
criteria needed to draw a parallel with phonology i.e. (1) a combinations 
of ‘meaningless’ elements into (2) a structure meaningful to receivers and 
(3) which meaning changes if elements change order and presence. Indeed, 
the authors argue this example represents a rudimentary form of phonemic 
contrasts. Given several other calls in the repertoire of these birds form 
call pairs (i.e. two calls given in distinct contexts and that differ only by 
one note)!"'5|, future studies will be needed to determine whether they also 
make use of a similar combinatorial mechanism. Tackling this question 
would be of particular importance as this would also help shed light on the 
productivity of the system (i.e. the extent to which notes are reused across 
the repertoire to create various types of utterances). 
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3.2 Parallels with morphosyntax? 


The combination of meaningful calls, via a mechanism resembling mor- 
phosyntax can involve the merging of sounds into combined calls as well 
as the combination of sounds (separated by silent gaps) into call sequences. 
In light of this, we therefore propose to differentiate between combined 
calls and call sequences. Such a distinction is advantageous for several 
reasons. Firstly, because this distinction falls in line with the traditional 
distinction forged between words and phrases (and between calls and call 
sequences in animals)'*°*4!, Secondly, it may serve to facilitate analyses, 
notably during preliminary phases of investigation. Finally, it may also be 
more realistic, as different underlying capacities, such as working memory 
requirements!®71161171 may be required to perform and interpret combina- 
tions of meaningful elements at these two levels in animals. 

In addition, the combination of morphemes can involve two types of 
elements: bounded morphemes that are always used in conjunction with 
others (e.g. suffixes), and free morphemes that constitute monomorphemic 
words when used alone!’**°!, An analogous form of the combination of these 
two types of morphemes in animals would correspond to the merging of 
one individual call (that can be used alone) with a vocal unit that is never 
given by itself and the merging of two individual calls!®*45! respectively. 
Interestingly, evidence for both types of call combination (i.e. using indi- 
vidual calls or calls that are never used alone) have been recently reported 
in the literature of one cooperatively breeding bird (southern pied babbler, 
Turdoides bicolor) and of two species of guenons: Campbell’s and Diana 
monkeys (see Table 2 for a summary). 


Southern pied babblers 


A recent study highlighted a combination mechanism in the alarm calls of 
the southern pied babbler, a cooperatively breeding bird living in the arid 
areas of South-Africa!"'®!, These birds emit an alert call with a broadband 
structure to low-urgency threats and a tonal, repetitive recruitment call in 
non-alarm contexts to attract group members to a new location (e.g. roost- 
ing or foraging). Critically, pied babblers sequentially combine alert and 
recruitment calls when encountering and mobbing terrestrial predators!"'®! 
(Figure 2, Table 2). 
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Figure 2: Spectrogram of a mobbing sequence composed of one alert and seven 
recruitment calls. Figure reproduced from Engesser et al. (2016). 
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Using a playback experiment, the authors tested the combinatorial structure 
of the mobbing sequence and its relevance to receivers by comparing wild, 
but habituated, pied babblers’ reaction to the broadcast of natural and 
artificially created stimuli (i.e. mobbing sequences created by combining 
alert and recruitment calls and single-call stimuli extracted from natural 
mobbing sequences). Subjects’ reaction to natural and artificial stimuli did 
not differ, which demonstrated that mobbing sequences consisted in the 
linear merging of alarm and recruitment calls and thus confirmed their 
combinatorial nature. Also, the distinct reactions given to the three call 
types presented (i.e. alert call, recruitment call and mobbing sequence) 
demonstrated the relevance of these calls to receivers. This study thus satis- 
fies the three criteria proposed for parallels with linguistic morphosyntax 
in a non-human animal i.e. individually meaningful calls, combined into a 
meaningful structure whose meaning reflects that of the elements involved. 
Interestingly, receivers’ reaction to mobbing sequences exceeded the sum 
of reactions to their components (i.e. higher attentiveness and quicker ap- 
proach). This suggests that, in this case, the combination of two elements 
did not simply lead to an addition of their meanings but potentially gave 
rise to a ‘richer’ meaning (i.e. ‘mobbing a predator’), that is related to, yet 
goes beyond, the meaning of its parts!*+"*), 


Diana monkeys 


Other studies, focussing on the communication of two cercopithecids, Diana 
and Campbell’s monkeys, revealed meaningful combinatorial systems that 
could offer rudimentary parallels with the morphosyntactic organisation of 
language. These arboreal primates live sympatrically in the dense primary 
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forests of West Africa!!! and their communication, which relies almost 
exclusively on sex-specific vocal signals, has been studied intensively over 
the past decades (e.g.?73040121-8l)_ In Diana monkeys, females possess four 
main types of social calls: H, L, R and A. The first three calls are associated 
with distinct contextual valences for the caller (very positive social context, 
neutral to mildly positive context and socio-negative or mildly dangerous 
context respectively). The last call (A) is given in a broad range of contexts 
and strongly signals caller’s identity?®"". Each of these calls can be given 
alone or in combination according to the following pattern: a contextual 
unit (i.e. H, L or R) merged with an arched unit (i.e. A) (Figure 3). 


Figure 3: combined calls of female Diana monkeys. (a) HA call (socio-positive 
contexts), (b) LA (neutral to positive contexts) and (c) RA calls (negative 
contexts and mild danger). 


kHz 


To verify the combinatorial structure (i.e. whether apparently combined 
calls consisted in the merging of individual calls) and test the relevance to 
receivers of distinct combined calls, Coye et al.l! conducted a playback 
experiment on females from a wild habituated group of Diana monkeys. 
In particular, to determine the relevance to receivers of the contextual unit, 
they compared subjects’ reaction to the stimuli created by merging L or R 
units (i.e. relating to distinct contexts) with an A call from a group member 
(i.e. LAGroup and RA stimuli). To determine whether A calls allowed 
receivers to identify the caller, they compared subject’s reaction to stimuli 


Group 


created by merging an R unit with either A calls from group members or 
A calls from females in a neighbouring group (i.e. RAGroup and RAngighbour 
stimuli). The change of one unit systematically triggered predictable changes 
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in receivers’ reaction. The results strongly suggest that the contact call sys- 
tem of female Diana monkeys relies on a combinatorial operation through 
which two independent calls are combined into a larger structure whose 
information content reflects its components. Hence, female Diana monkey’s 
contact call system matches the three criteria we proposed and may offer a 


parallel with morphosyntax.!*! 


Campbells monkeys 


Similarly to Diana monkeys, adult Campbell’s monkeys possess a sex- 
specific vocal repertoire as females’? communication relies mostly on social 
calls while males give mainly alarm calls!4"”!, A notable example from 
female Campbell’s monkeys involves the merging of a low-pitched trill 
(resembling Diana monkey’s L call), which can also be used alone and 
varies with caller’s emotional state!!??!, with an arched unit that strongly 
signals a caller’s identity and social affiliation!'*°°! (i.e. resembling Diana 
monkey’s A call). However, contrarily to Diana monkeys, the second unit 
combined (i.e. identity-rich arch) by Campbell’s monkeys is never used 
alone, suggesting a mechanism more akin to suffixation. Playbacks verify- 
ing the combinatorial structure of complex calls remain to be performed. 

Intriguingly, male Campbell’s monkeys also use a combinatorial system 
resembling suffixation in their alarm calls (Figure 4, Table 2)!3!+°7], More 
precisely, in Campbell’s monkeys, males possess two urgent alarm calls, 
Krak and Hok. While the first generally signals the presence of an urgent 
ground danger (i.e. classically a leopard in the Tai National park), the latter 
signals urgent aerial dangers (i.e. classically an eagle)". These calls can 
also be combined with a unique ‘oo’ unit to create Krak-oo and Hok-oo 
calls (Figure 4). While the ‘oo’ unit is never used alone, its addition to 
Krak or Hok calls seems to reduce the danger signalled given Krak-oo and 
Hok-oo calls signal a general disturbance (e.g. a duiker) and an aerial dan- 


ger of lesser urgency (e.g. a fight in an associated group of red colobus)!*®, 
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Figure 4: Sonograms of the calls of adult male Campbell’s monkeys. (a) Krak 
(urgent ground danger), (b) Krak-oo (non-urgent general disturbance), 
(c) Hok (urgent aerial danger) and (d) Hok-oo calls (non-urgent aerial 
danger). Figure reproduced from Coye et al., 2015!7). On sonograms 
(b) and (d), the black arrow signals the position of the ‘oo’ unit. 
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Natural observations were complemented by a playback experiment aim- 
ing to verify the combinatorial nature of Krak/Krak-oo calls in this alarm 
system“, The authors analysed the reaction of Diana monkeys (which 
react to the distinct alarm calls of Campbell ’s monkeys with their own ref- 
erential alarm calls!14) to broadcasted natural and artificially recombined 
Krak and Krak-oo calls created by deleting the ‘oo’ part of a Krak-oo or by 
adding an ‘oo’ part to a Krak call respectively. Subjects’ reaction to Krak 
and Krak-oo calls reflected their distinct levels of urgency, regardless of 
their origin (i.e. natural or artificially created). Statistical analysis suggested 
that, although subtle changes in acoustic structure of the Krak part were 
perceived by receivers (possibly as a result of caller’s emotional state at the 
time of calling), the presence or absence of a suffix was the main factor 
leading subject’s reaction. Thus, in addition to confirming the combinato- 
rial nature of Krak-oo calls (i.e. which result from the linear merging of a 
Krak call with an ‘oo’ unit), this experiment demonstrated that changes 
in call structure triggered predictable changes in receiver’s reactions and 
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confirmed the biological relevance of the addition of an ‘oo’ unit to decrease 
the urgency of Krak calls!°*!. Here again, we can thus conclude that the 
three criteria proposed to define rudimentary parallels with morphosyntax 
in animals are met. 

The long-term study of Campbell’s monkeys’ communication also re- 
vealed another type of combination in the alarm calls of males that are 
given in long sequences, the organisation of which appears to vary with 
the context!*?!, Although this system does not possess the complexity of 
syntactic structures occurring in language, the type of calls involved as well 
as the position of some call types in the sequence seem to obey non-random 
rules and may well be meaningful to receivers*®1331, Krak-oo calls, which 
signal general alerts, are found in most (but not all) alarm sequences given 
by males! In addition, several regularities have been described, as not 
only sequence composition varies according to the context but also to the 
order of calls given and their rhythm of emission. Distinct call types can be 
added to Krak-oo sequences, such as Krak and Hok calls (that appear at 
the beginning of a sequence), depending on the type of danger, in particular 
the type of predator (leopard and eagle) encountered!“ In addition, the 
urgency of the situation (e.g. visual vs auditory detection of the predator) in- 
fluences the speed of call delivery of Krak-oo calls in the sequence!" while 
the speed of call delivery for Hok calls (when an eagle is detected) relates 
to a male’s willingness to attack the predator". Boom calls (i.e. another 
call type) are always given in pairs and trigger group gathering and move- 
ment when produced singly. However, when Booms are followed by other 
calls, they signal non-predatory events and the calls following them vary 
with the context. For example, a Krak-oo sequence follows Booms when 
a large branch or tree is falling down'*®'3*!, The insertion of Hok-oo calls, 
systematically between Boom and Krak-oo series, to these “tree-falling” 
sequences (i.e. Booms-Krak-oos) occurs during inter-group encounters 
with neighbours (i.e. Booms-Hok-oos-Krak-oos)*"!, A first playback study 
has investigated the ‘non-predatory’ modification of a sequences’ message 
through the addition of Boom calls by comparing receivers’ reaction to 
natural predator-deterring sequences preceded or not by Boom calls'37|, 
Now, several studies will be required to experimentally verify the other 
interesting patterns of organisation derived from observational data. 
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Table 2: Summary of the main characteristics of the combined calls in the species 


under the focus of this chapter. 


Species Structure of |Meaningful |Link Mechanism References 
combined elements? {between {for meaning 
vocalizations the differentiation 
meaning 
of units & 
combined 
structures 
Chestnut-crowned_ |Calls No No Rudimentary [94, 155, 
babbler composed of form of 157] 
(Pomatostomus several notes ‘phonemic 
ruficeps) contrast’ 
Southern pied Combination | Yes Yes Rudimentary [129, 
babbler (Turdoides |of individual form of 153-154] 
bicolor) calls into call ‘morphosyntax’: 


sequence 


combination of 
single calls 


Diana monkey Combination |Yes Yes Rudimentary [29, 93, 
(Cercopithecus of individual form of 113, 115] 
diana) calls into morphosyntax: 

larger calls combination of 

single calls 

Campbell’s monkey|Combination | Yes Yes Rudimentary [40, 92, 
(Cercopithecus of alarm calls form of 118] 
camp belli) with a call morphosyntax: 

unit never combination 

used alone. of calls with a 

Combination ‘proto-suffix’ 

of calls in 

sequences 


The experimental results presented in this section demonstrate that some 
animal species combine meaningful structures in non-random ways to cre- 
ate richer signals (i.e. conveying more complex information) or to diver- 
sify the messages conveyed with only a limited number of distinct calls 
(Table 2). In each example described in this section, the three criteria we 
proposed to classify calls as phonological or morphosyntactic structures 
were met. Additional testing will clearly be necessary to further our un- 
derstanding of the relevant mechanisms underlying such combinatoriality 
and subsequent changes in meaning. Notably, it will be necessary to repli- 
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cate the recombination experiments on other calls of Campbell’s monkeys 
(ie. males Hok/Hok-oo calls) and Diana monkeys (e.g. HA calls) to deter- 
mine the pervasiveness of sound combinations in these species. Experiments 
manipulating the order of units in combined calls and call sequences will 
also be required to fully determine how such changes alter the information 
extracted by receivers. 


4. Formal linguistic analysis of combinatorial systems: 


Whilst ethologists have relied on linguistics as a source of inspiration for 
years, more recently linguists have also begun to systematically compare 
and contrast animal and human communication systems applying methods 
from formal linguistics (i.e. posing rules to define formally a ‘lexicon’, a 
‘syntax’ and ‘semantics’ for a given system)'**”2], Among other primates, 
the vocal systems of Campbell’s and Diana monkeys have been subjected 
to such analyses in studies by Schlenker and colleagues. The authors reana- 
lysed existing data on these guenons providing complementary investiga- 
tions to the ethological approach!®**138), 

A first study on Campbell’s monkeys, focussed on the possible semantic 
content of Krak, Hok and their ‘suffixed’ versions. It compared models 
built using methodologies from the field of formal semantics to shed light 
on the possible meanings of these calls and on the mechanism by which 
the addition of an ‘oo’ unit alters the meaning of the call ‘stems’ (i.e. Krak 
and Hok). Authors specifically focused on the distinct calling patterns of 
males from two populations of Campbell’s monkeys in Ivory Coast (Tai 
National Park) and Sierra Leone (Tiwai island)!**!. Crowned-hawk eagles 
are present in both areas and leopards still being present in Tai, but absent 
from Tiwai for as long as thirty years'¥*!,. Importantly, while Hok functions 
to signal the presence of an eagle in both populations, Krak is used primar- 
ily to signal the presence of a leopard in Tai but it has the distribution of 
a general alarm call on Tiwai (i.e. given to a broad range of disturbances 
including falling trees and eagles)“®%14140], To determine which ‘semantic’ 
explanation best captured the patterns observed, the authors systematically 
tested the predictions of two models against the data. The first model posits 
the hypothesis that in both populations Krak and Hok calls have the same 
“innate” meanings (i.e. Krak: general disturbance; Hok: aerial predator) 
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and that the addition of an ‘oo’ unit decreases the urgency of the innate 
meaning of both calls (i.e. Krak-oo: general and less urgent disturbance; 
Hok-oo less urgent aerial disturbance). Finally, this model hypothesises 
that, while Krak-oo is derived from the innate meaning of Krak in both 
areas (i.e. the Krak ‘root’ of Krak-oos kept its original meaning), the ‘lexi- 
cal entry’ for Krak in Tai has changed to ‘leopard-related disturbance’. 
The second model proposes an alternative hypothesis to explain the pat- 
tern described: the innate meaning of Krak and Hok calls is the same in 
both populations (i.e. Krak: general disturbance; Hok: aerial predator) 
and it holds in both unsuffixed and suffixed calls. But while both Krak-oo 
(i.e. non-urgent danger) and Hok (i.e. aerial predator) are specific, Krak 
has a rather broad meaning (i.e. general alarm call). The second model thus 
proposes that the competition between more specific calls and Krak calls 
may lead to the strengthening of the meaning of Krak in a mechanism akin 
to scalar implicatures'**”! (see Table 1 for a definition). Specifically, when 
a male gives Krak calls, a receiver might infer that there is a non-weak and 
non-aerial disturbance as the call given is not Krak-oo nor Hok. Hence, the 
meaning of Krak calls can be strengthened from ‘general urgent disturbance’ 
into ‘dangerous non-aerial predator’. In Tai the presence of leopards led 
to the strengthening of the meaning of Krak calls as ‘dangerous non-aerial 
predators’ but not in Tiwai where the absence of ground predators pre- 
vented it. From this, the authors concluded that the second model was more 
parsimonious and more likely to describe the associated ‘meanings’ of calls 
in the call system of Campbell’s monkeys than the first one'**!, 

A second study led by Schlenker and collaborators proposed to ana- 
lyse the communication of female Diana monkeys, using both a statistical 
analysis of transition probabilities between units and call types and a for- 
mal semantic analysis of utterances based on their context of emission!”), 
Again, the authors proposed two alternative competing hypotheses to de- 
scribe the system observed. The first hypothesis proposed that combined 
calls consisted of two simple calls given in close succession (i.e. maximized 
adjacency hypothesis). Contrarily, the second hypothesis proposed that 
combined calls (i.e. HA, LA and RA calls) resulted from the combination 
of two units that were subsequently used as one call (i.e. combined calls 
hypothesis). To determine which hypothesis was the more likely, the au- 
thors developed two corresponding models (e.g. putative ‘rules’ of call use 
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describing the observed patterns) and compared them. This work showed 
that treating ‘combined’ calls as sequences of simple calls given in close 
succession failed to account for their distribution in sequences. The most 
parsimonious model was obtained under the ‘combined call hypothesis’ 
(i.e. ‘combined calls result from the combination of vocal units and are 
used as one call) as the alternative hypothesis (i.e. maximized adjacency 
hypothesis) would need to be supplemented by phonological complexity 
in order to account for the data with respect to maximal sequence length 
and call repetition"?! 

Other recent articles by the same authors offer analyses of the calling 
systems of additional species using similar methods (e.g. black-fronted titi 
monkeys and putty-nosed monkeys)!>?"4l, The results obtained converged 
with field observations and these articles are key in not only generating 
testable hypotheses but also confirming the relevance of using linguistic 
methodologies to analyse combinatorial systems in non-human animals. 
We argue these studies bring key additional support to our findings while 
adopting different, yet complimentary methodological approaches. Indeed, 
although previous studies had also described non-random patterns of transi- 
tions between elements comprising vocal sequences produced by animals 
(e.g. marine mammals'"41), bats!>7, birds!4®°31471), they failed to take into ac- 
count the meaning and relevance to receivers of sequence organisation and 
composition. For instance, Kershenbaum and collaborators!” analysed the 
vocal sequences produced by animals from several taxa (i.e. killer and pilot 
whales, rock hyraxes, Bengalese finches, Carolina chickadee, free-tailed bats 
and orangutans) using various transition models of increasing complexity 
to determine which one matched best the transition between elements in 
the sequences recorded. Such studies are very informative regarding the 
possible evolution of sequence complexity in animals and may participate 
in bridging the gap between human language and animal communication!®®, 
However in language, combination is relevant only because it is meaning- 
ful!®4!, The work reviewed in this chapter highlights the need to include a 
more systematic analysis of animal sequence structure, meaning and com- 
position and its relevance to receivers, in particular, when one aims to 
undertake a comparative approach with human language. We believe that 
the current progress on animal combinatorial abilities, together with future 
developments in complementary methodological approaches and appropri- 
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ate terminology, will pave the way to a more comprehensive understanding 
of the evolution of sound combination in animals. In the final section we 
will discuss the evolutionary insights such comparative data can provide 
on the drivers of the emergence of combinatorial abilities. 


5. Evolutionary relevant insights from animal combinatorial 
systems? 


The topic of language origins is frequently accompanied by heated debates 
over the analogous (i.e. convergent evolution) or homologous (i.e. inher- 
ited from a common ancestor) nature of some parallel features of lan- 
guage described in non-human primates, including combinatorial abilities 
(e.g. 1143-1481) However, we forward that this is not the most pressing 
question, because we can learn a lot from the study of animal communica- 
tion regardless of its shared or distinct evolutionary history with language. 
Indeed, if language is a unique communication system, it is also clearly the 
product of a gradual evolutionary process and, in this regard, it does not 
differ from other animal communication systems. Thus, in our opinion, a 
more important question to tackle would be — what pressures drove the 
evolution of combinatorial abilities? 


Social complexity 


Social life is often viewed as a major driver of communicative complexity 
and this hypothesis has been supported by empirical studies highlighting 
a positive relationship between indexes of social complexity and signal di- 
versity for both social and alarm calls!*”*?-'?!, The description and testing 
of combinatorial systems in animals suggests that sound combination may 
allow the diversification of a species’ repertoire using a limited number of 
signals. Interestingly, two studies comparing closely related species of non- 
human primates and herpestidaes reported a correlation between the com- 
plexity of a species’ social life and the presence, diversity and frequency of 
use of combinatorial structures!?-154!_ In line with this, each species in which 
meaningful sound combinations were described has also been reported to 
reside in a complex and strongly bonded social group?°*-*!, These obser- 
vations support the idea that increased needs for complex communication 
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resulting from social complexity might have played an important role in 
the emergence of combinatorial capacities in animals. 


Phonatory limits 


Another possible factor leading to the emergence of combinatorial ca- 
pacities results from the phonatory limits that some species face, notably 
in non-human primates. Work from computational modelling provides 
relevant additional insight here. Nowak and colleagues!'*?! modelled sce- 
narios for the emergence and propagation of certain language features ina 
population, such as arbitrary signals, sound combinations and grammati- 
cal rules. Nowak et al. proposed that combinatorics would emerge after a 
communication system reaches a threshold number of signals above which 
the addition of new signals (because they would be likely to resemble ex- 
isting ones) ultimately increases the error risk due to mis-comprehension. 
In this case, the combination of sounds would allow a continued increase 
in a language’s fitness (through addition of new signals) without increas- 
ing the risk of ambiguous information transfer. This rationale relies on 
the hypothesis that a species is capable of increasing its repertoire via the 
acoustic diversification of signals in the first place. We propose that it is 
also valid in species with limited capacities of vocal production but that 
in this case, the first limit to signal diversification might be the species’ 
lack of vocal plasticity rather than the breadth of the existing repertoire. 
This hypothesis is supported by the fact that all the species in which 
sound combination has been shown to play a meaning-differentiating 
role display limited capacities for vocal production!!!8121130.156.160], Further 
studies investigating the presence of meaningful call combinations in spe- 
cies characterised by various levels of social complexity as well as distinct 
capacities for vocal learning are key to testing these hypotheses with more 
extensive empirical data. 


Habitat and constraints on communication 


Finally, habitat has often been proposed as a factor influencing species’ com- 
munication. In particular, it has been proposed that dense habitats, which 
impose constraints on sound propagation and visual access to others, may 
favour the emergence of discrete communication signals (i.e. as opposed 
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to communication systems of graded signals, whose acoustic structures 
form a continuum without distinct boundaries between call types!!61-164), 
In dense habitats, discrete signals would allow more robust communication 
and prevent ambiguities resulting from poor visual access to others!6®-163-1651, 
Sound combination may therefore benefit animals through the production 
of more efficient communication signals. This is, for instance, the case in 
female Diana monkeys, whose combined calls linearly convey information 
about a caller’s emotional state and identity. Here, females concatenate 
signals sequentially which might have already evolved to ensure maximal 
efficiency of information transfer (e.g. calls with more salient identity cues 
or with an improved acoustic adaptation to propagation constraints). This 
organisation allows information to be temporally segregated creating richer 
signals without increasing ambiguity due to the accumulation of infor- 
mation. Interestingly, a combinatorial system resembling that of Diana 
monkeys has been described in the graded contact calls of desert-living 
banded mongoose. More precisely, banded mongooses use a contact call 
composed of two segments, given in three distinct contexts: when the caller 
is digging, searching and moving!!*, The first segment relates strongly to 
a caller’s identity and remains identical in the three contexts. A playback 
experiment confirmed that between-caller variations in the identity segment 
were relevant to receivers”, The second segment has a graded structure 
and varies with caller’s activity: when the caller is digging the segment is 
absent (or very short), it’s duration increases when the caller is searching, 
and reaches its maximal duration (together with more pronounced harmon- 
ics) when the caller is moving. Thus, here again, the use of combinatoriality 
seems to increase the information content of calls while maintaining a low 
level of ambiguity. 

Importantly, banded mongoose live in an open habitat but lack visual 
access to conspecifics because their foraging strategy constrains them to face 
the ground most of the time!'®*1®!, Although literature traditionally pitted 
species with graded and discrete repertoires against each other, multiple 
concerns with the relevance of this dichotomy have been raised, notably 
because of evidence for subtle gradation in the communication of ‘discrete’ 
species as well as evidence for categorical perception of graded signals by 
receivers!'®170-173], Taken together, these observations suggest that, more 
than habitat-based propagation constraints, the lack of visual access to 
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others may be important in influencing the emergence of short-distance 
combined social call structures that convey complementary information 
about a caller’s identity, activity and localisation. Finally, this hypothesis 
can be aligned with the theoretical work of Nowak and collaborators!'?!, 
Indeed, if combination has emerged to limit the risk of ambiguous com- 
munication, the inability to disambiguate the context of calling or caller’s 
identity using visual cues (e.g. due to habitat constraints or foraging strat- 
egy) is a possible additional factor triggering its emergence. To investigate 
the potential relative impacts of habitat density and actual visual access to 
others on the development of combinatorial capacities, we would need to 
extend the comparison to other species whose visual access to others con- 
tradicts the predictions that could be proposed by simply looking at their 
habitat density (e.g. other species than banded mongoose with poor visual 
access in spite of an open habitat). 

The hypotheses proposed above shed further light on the factors involved 
in the evolution of language and other communication systems and have 
largely resulted from data provided by only a few species where receivers 
have been experimentally documented to process and use combinatorial 
structures. Various additional examples also exist that have not yet been 
completely described and are likely to fit the definitions we have previ- 
ously used (i.e. in section 1). In particular, several systems in which appar- 
ently meaningful calls are combined into larger structures whose context 
of emission reflects that of its parts have been described in wedged-caped 
capuchins!'"4], cotton-top tamarins'*!, female Campbell’s monkeys!!! red- 
bellied titi monkeys!*7, black-fronted titi monkeys'®*!”5!, red-capped mang- 
abeys'”!, bonobos!!”6! and chimpanzees!” as well as in non-primate species 
such as Japanese great tits'”!, banded mongooses!*+6! and meerkats!’ 
The diversity of species in which meaningful sound combinations have been 
documented (e.g. birds and mammals, including primates and herpestidae) 
suggests that it may be an evolutionary solution to deal with communica- 
tive demands and the comparative study of these species is central to test 
any hypotheses regarding the potential drivers promoting the emergence 
of combinatoriality4#**71, 
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6. Conclusion: Towards a more comprehensive approach of 
combinatorial abilities 


The study of animal combinatorial abilities appears to be a promising re- 
search area, with a number of avenues open to exploration. In this chapter, 
we restrict discussion to examples exhibiting more or less marked parallels 
with language. However, a large number of other combinatorial systems in 
animals remain to be investigated, among which some have been described 
but remain only partially understood (e.g. gorillas, putty-nosed monkeys, 
gibbons, rock hyraxes, mustached bats!*4°5-6.92107-109.1141)_ Tn addition, some 
animal combinatorial systems may differ strikingly from language in their 
organisation and underlying mechanisms facilitating information transfer. 
For instance, in some systems the diversity of units (e.g. in some song- 
birds!) or the proportion of various units (e.g. in bonobos!!761”!) seem 
to play a role in meaning generation. 

Joint efforts from linguists, psychologists and ethologists is clearly nec- 
essary to provide a unified and more relevant framework. One possible 
first step would be to develop a terminology suitable to describe the vast 
diversity of combinatorial systems found through the animal kingdom. In- 
deed, whilst some rare examples developed in this chapter can be captured 
by pre-set definitions, it seems clear that a number of sound combination 
systems will not. However, even (if not especially) in those cases, the use of 
strict definitions is essential. This is important firstly to provide a clearer 
view of the diversity and complexity of combinatorial organisations in 
the animal kingdom. Secondly, and perhaps more pertinently, because the 
study of varied examples relating to potentially meaningful, non-random 
and contextually flexible combination patterns may be an important step to 
further understand the biological relevance of vocal combination in animals 
and its evolution(s). 

The rationale adopted to build the two definitions we propose for par- 
allels with phonology and morphosyntax (section 1) could be generalised 
to develop a more suitable terminology for alternative systems of sound 
combination. Notably, the definitions we proposed involve three compo- 
nents: (1) whether the vocal units combined possess an intrinsic meaning, 
(2) whether (and how) the meaning of their combination reflects the mean- 
ing of individual elements (if they possess one) and (3) which rules (if any) 
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best describe the mechanism for meaning-differentiation between combined 
utterances with distinct functions. In addition, the third component of the 
definitions would also allow us to capture systems that differ strongly from 
language, such as those relying on the proportion of one call type, or on the 
diversity of units involved. As such, this three-component structure could be 
further expanded to characterise the variety of animal combination systems 
described while offering a systematic basis for interspecific comparison. 

In addition, it may be useful that authors specialised in various taxa 
(e.g. ornithologists, primatologists, marine biologists) and disciplines 
(e.g. ethologists, philosophers, linguists) readdress questions pertaining to 
the nature of meaning as this central question may be approached from 
various directions (e.g. can regional ‘dialectal’ variations of a song also be 
considered as changes in meaning?). In any case, future studies focussing on 
diverse combinatorial systems, including systems that differ strongly from 
language, are likely to be fruitful. Indeed, understanding the organisation 
and evolution of systems that differ strongly from ours will, if anything, 
bring insights into the various evolutionary paths that the human lineage 
did not follow and may be a relevant source of information to identify 
important turning points in our “history”. 

This chapter focussed mostly on studies that relied on the simultaneous 
use of natural observations of calling contexts and experiments. These two 
complementary approaches are essential to investigating the combinatorial 
structure (i.e. transferability of units in combination) of complex utterances 
and the relevance to receivers of changes in meaning as a result of changes 
in the combinatorial structure. We argue that this is a key first question to 
tackle in order to provide a comprehensive description of animal communi- 
cation systems. However, beyond the combination of phonemes into words 
and words into sentences, language relies on a set of rules that allow in- 
terlocutors to produce and understand completely novel utterances!848715%, 
The cognitive mechanisms underlying our processing of rules and our ca- 
pacity to generalise them are at the very base of language generativity! t8”, 
Now that some animal communication systems have been characterised 
in terms of their basic structure and meaning-differentiating mechanisms, 
the next step should be dedicated to clarifying the cognitive mechanisms 
underlying their combinatoriality. In particular, it will be important to 
determine whether animals perceive combined utterances as a mosaic of 
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elements whose message can be inferred from the elements and their rela- 
tionships (i.e. as compound signals) or as unique elements whose combina- 
torial nature is only structural’, For instance, can Campbell’s monkeys 
learn the meaning of Krak, Krak-oo, Hok and Hok-oo calls independently 
or do they learn the meaning of Krak, Hok and the alteration of meaning 
associated with the presence of an oo unit? And would they be capable to 
generalise that “rule”? Previous studies have shown that some non-human 
primates possess particularly sophisticated social cognition skills involving 
a hierarchically structured representational knowledge of social relation- 
ships, governed by rules and involving causal inference — a likely result 
of their complex social life!!#”18°1%!_ In addition, studies based on experi- 
mental tasks suggest that some animals possess, to some extent, capacities 
to parse combinatorial and sequential artificial structures or rules!'*!, For 
instance non-human primates of several species have been shown to learn 
sequential lists of items!'*+185!, and to compute probabilities of occurrence, 
and dependencies between syllables or letters!86-!8°1, Cotton-top tamarins 
have the capacity to acquire general ‘rules’ of structuring such as simple 
pFSA grammar!” and starlings, as well as language-trained animals (apes, 
dolphins and parrots), even acquired more complex rules such as recursive 


11 or generative languages!!?2-1°4), In line with this, a particularly 


structures! 
interesting perspective would be to compare the experimental performances 
in laboratory settings (e.g. assessing a species’ capacity to acquire and han- 
dle various artificial grammars) of species using combinatorial structures 
to various extents and determine whether their performances at parsing 
artificially constructed structures correlates with their natural tendency to 
use sound combination during communication. Finally, studies clarifying 
the ontogeny and acquisition mechanisms of vocal repertoires involving 
combinatorial structures would be important to complete our knowledge, 
especially in species that seem to rely on meaningless notes, as the arbi- 
trariness of combinations should be based on a mandatory learning phase. 

To conclude, we can say that there now exists a growing array of spe- 
cies that rely on sound combinations and as such provides an intriguing 
starting point for investigations into the evolution and emergence of these 
abilities. When reviewing combinatorial structures across species, however, 
it becomes clear that questions associated with meaning and information 
conveyed by animal signals are central and should be taken into account in 
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the development of an appropriate terminology to describe sound combina- 


tions in animals. Finally we actively encourage interdisciplinary research 


uniting linguists, ethologists, psychologists and anthropologists, to build 


a unified framework and to further explore the links between human lan- 


guage and animal communication. 
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Primate Roots of Speech and Language 


Abstract: Human language is largely a vocal behaviour that has evolved from a 
more ancient primate communication system. Although vocalizations are also the 
main way by which nonhuman primates communicate and interact socially it has 
been difficult to demonstrate direct transitions from non-linguistic primate vocal 
communication to human language. Nevertheless, several continuities are apparent. 
First, primates produce and perceive sounds by specialized anatomical and neural 
structures also present in humans. Compared to humans, however, nonhuman pri- 
mates are severely limited in the amount of control they have over vocal production, 
which restricts their ability for phonology, syntax, and vocal learning. But language 
is also a cognitive capacity and here there is good evidence that primates understand 
others’ calls as given by specific individuals to specific events or social interactions. 
In great apes, moreover, callers can take the past history with their audience into 
account, by suppressing, exaggerating and socially directing their calls in strategic 
ways. Yet, there is no clear evidence that primates, apart from humans, perceive 
others as governed by complex mental states, such as shared knowledge or false 
beliefs, during acts of communication. Also, primates do not seem to be motivated 
to convey knowledge relevant to their audience and there is no clear indication that 
they use vocal behaviour for the purpose of social bonding. The current hypothesis 
is that these differences in cognition and motivation have prevented the evolution 
of flexible, combinatorial vocal communication in nonhuman primates. 


Keywords: nonhuman primate communication, intentionality in communication, 
referential communication 


1. Introduction 


Considerable debate surrounds the question of how and why human 
language has evolved from non-linguistic forms of primate communica- 
tion. From an evolutionary perspective, it is implausible that language has 
emerged without any relevant precursors, so the debate is largely on the 
nature of continuities and discontinuities between nonhuman primate and 
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human communication (Fitch and Zuberbühler, 2013). Most likely, lan- 
guage is a mosaic of older components that have emerged during different 
evolutionary time periods, some of which perhaps only recently during 
hominine evolution. The goal of this chapter is to identify key components 
of the language faculty, particularly vocal production and the cognitive 
mechanisms underlying language, and to explore how they might have 
evolved from earlier forms. This is done within a comparative approach 
and special focus on great ape natural communication. Throughout the 
chapter, a distinction is made between communication, speech and lan- 
guage. Communication is defined as the exchange of thoughts, messages, 
or information, as by speech, signals, writing or behaviour (n.d. 2011b). 
Speech is defined as an the act of expressing or describing thoughts, feelings, 
or perceptions by articulation of words (n.d. 2011a) and language as com- 
munication of thoughts and feelings through a system of arbitrary signals, 
such as voice sounds, gestures or written symbols (n.d. 2011c). 


2. Vocal production! 


The basic mechanism for sound production in mammals is described by the 
source-filter theory. Both animal vocalisations and human speech sounds 
are produced by an apparatus that consists of two independent mecha- 
nisms, the source (larynx) and the filter (supra-laryngeal vocal tract (Fant, 
1960). Compared to other primates and most other mammals, however, 
human vocal communication is highly unusual. Humans not only possess 
a species-specific repertoire of non-linguistic vocalisations, but they also 
have the capacity to actively control the apparatus in order to produce 
sustained airflow, which then generates stable vibration of the larynx, the 
fundamental frequency and acoustic source of speech (Herbst, 2016). In 
addition, humans have fine motor control of various anatomical structure 


1 Based on previously published material by A. R. Lameira, I. Maddieson, and 
K. Zuberbühler, ‘Primate Feedstock for the Evolution of Consonants’, Trends 
in Cognitive Sciences, 18/2 (Feb 2014a), 60-62. and W.T. Fitch and K. Zuber- 
bihler, ‘Primate Precursors to Human Language: Beyond Discontinuity’, in 
Eckart Altenmuller, Sabine Schmidt, and Elke Zimmerman (Eds.), Evolution of 
Emotional Communication: From Sounds in Nonhuman Mammals to Speech 
and Music in Man (Oxford: Oxford University Press, 2013). 


Primate Roots of Speech and Language 235 


involved in speech production, so-called articulators, the result of which 
are several hundred perceptually distinct phonemes, the building blocks 
of the 7,000 or so currently spoken languages. Each language only uses a 
small fraction of this species-specific repertoire and speakers loose the abil- 
ity to discriminate the different sound contrasts during ontogeny (Crystal 
1997). The variability across languages is enormous. Some languages use 
nine vowels and ten consonants (Andoke; Colombia) while others only 
three vowels and 22 consonants (Diyari; Australia; Lameira et al., 2014). 
Vowels and consonants are the product of complex vocal tract configura- 
tions by the tongue, lips and jaw, while the sound producing activity of 
the larynx remains relatively constant. Some consonants are produced as 
co-articulations with vowels, while others are produced without vocal-fold 
vibration (e.g. voiceless stops /p/, /t/, /k/). Recent work has shown that 
the human vocal apparatus is not fundamentally different from those of 
non-human primates, suggesting that there are no anatomical reasons that 
would prevent other species from producing speech. The vocal production 
apparatus of higher primates, in other words, is speech-ready (Boé et al., 
2017; Fitch et al., 2016). 

If the main evolutionary transition towards speech has been at the level 
of larynx control and not, as claimed for decades, due to differences in vocal 
tract anatomy (Fitch and Zuberbühler, 2013; Lieberman, 2012), then the 
main question is how and why humans have evolved such an unprecedented 
level of vocal control. One line of research addresses this kind of human 
uniqueness at the neural level, insofar as the muscles that operate the larynx 
appear to be governed by projections from the motor cortex to the brain- 
stem nuclei that steer laryngeal muscles (Jurgens, 2002). One hypothesis 
is that this direct cortical pathway may be the neural cause of the fine mo- 
tor control that humans have over their laryngeal musculature to produce 
the fundamental acoustic source for speech and for singing. Whether or 
not the duet songs in primates, such as gibbons or indris, are governed by 
human-like laryngeal control is currently unknown (Filippi, 2016; Gamba 
et al., 2016; Geissmann, 2002). In humans, laryngeal control is responsible 
for the rhythmic and intonational aspect of language, i.e. speech prosody 
(Hirano et al., 1969; Ohala, 1990). Yet, the physiological mechanisms 
underlying prosodic features are still poorly understood (Erickson, 1995; 
Finnegan et al., 2000; Lecuit and Demolin, 1998). Some progress has been 
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made by modelling the relationships between fundamental frequency and 
subglottal pressure and with other techniques, such as laryngeal electro- 
myography (e.g. Riede et al., 2011; Riede, 2011; Zhang and Ghazanfar, 
2016). Further research in this field is likely to contribute to explain the 
human uniqueness of speech. 


3. Vocal learning 


Within the primates, humans are the only species that can vocally imitate, a 
behaviour that starts early in ontogeny. One hypothesis for why non-human 
primates are prevented from vocal learning is in terms of differences in la- 
ryngeal control. If an individual cannot control sound production at will, 
then it cannot vocally imitate arbitrary sound pattern, a basic requirement 
for speech acquisition. Non-human primate communication, instead, is 
limited to biologically hardwired vocal repertoires, a collection of signals 
given in context-related and often age- and sex-specific ways (Zuberbühler, 
2016b). Typically, different call types have different biological functions, 
such as ‘greeting’ calls to facilitate social interactions or ‘alarm calls’ to 
avoid predation. Other examples are food calls, copulation calls, movement 
calls, long-distance calls, or lost calls. For some call types, phylogenetic 
relatedness and acoustic similarities are linked, with more closely related 
species producing more similar calls, a finding that includes human vocali- 
sations (Davila Ross et al., 2009; Kersken et al., 2017). 

Some call structures appear to have evolved to be psychologically ef- 
fective on receivers, either by having physiological effects or by triggering 
relevant psychological processes in recipients, such as increased attention 
or facilitated learning (Owren and Rendall, 2001). However, all this is 
not to say that primate sound production is completely inflexible. Various 
studies on monkeys have shown that social variables can influence the 
acoustic structure of some call types. For instance, in Campbell’s mon- 
keys, the strength of a social bond between two individuals significantly 
correlates with the acoustic similarity of their contact calls, independent 
of genetic relatedness (Lemasson et al., 2011). Evidence of this type is 
relatively common, but is usually in terms of acoustic variations within an 
existing call type rather than the emergence of new vocal structures, which 
is very different from the ease by which human infants produce, combine 
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and learn complex sound utterances. Also relevant is research on captive 
primates, which has shown that subjects can be trained to produce some 
calls on command and to modify the acoustic structure of their calls (Fitch 
and Zuberbühler, 2013). 

Given the seemingly unbridgeable gap in motor control between hu- 
man and nonhuman primate vocal behaviour, how did speech evolve? A 
relevant finding here is that motor control of the supra-laryngeal vocal 
tract may be less of a constraint, at least for great apes, compared to con- 
trol of the larynx. Great apes possess call repertoires that comprise both 
voiced and voiceless calls, with some voiceless calls being subject to social 
learning (Lameira et al. 2014). In wild orang-utans, for example, there are 
population differences in calls produced during nest building with some 
populations producing ‘raspberries’ and others ‘smacks’, in the absence 
of genetic or habitat differences. Captive orang-utans can imitate human 
voiceless whistling sounds, a clear demonstration of their ability to control 
key speech articulators and airflow. In chimpanzees, an interesting case 
study is ‘Viki’, who underwent intensive language training by her human 
caretakers and so learned to shape the vocal tract to imitate few English 
words (‘mama’, ‘papa’, ‘cup’), although unable to activate the larynx as a 
sound source (Hayes, 1951). 

A reasonable conclusion from this literature is that the common ances- 
tor of humans and great apes already possessed the ability to control their 
supra-laryngeal vocal tract, while vocal fold control emerged later and 
perhaps only after the phylogenetic split from the other great apes (de Boer 
et al., 2015). 


4. Concepts and categories” 


Human language is not just a vocal behaviour but the product of a cogni- 
tive architecture that is in constant interaction with social partners. The 
main tools for these social interactions are mental representations of the 
world, organised in humans as concepts (defined as mental representations 


2 Based on material published by K. Zuberbühler, ‘Social Concepts and Com- 
munication in Nonhuman Primates’, in Mark A. Bee and Cory T. Miller (Eds.), 
Psychological Mechanisms in Animal Communication (New York: Springer, 
2016a). 
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that refer to categories, i.e. sets of entities), with corresponding symbolic 
representations in a speaker’s linguistic vocabulary. Nonhuman primates 
do not have linguistic lexicons but they may nevertheless have mental con- 
cepts akin to ours. So how do primates and other animals organise their 
worlds internally? What are natural concepts in primate minds and how 
are they linked to vocal communication? In the following, evidence for dif- 
ferent types of social concepts is reviewed, including dominance, kinship, 
friendship, or group. 


4.1 Dominance 


A common challenge in social groups is that some individuals exert social 
power over others, which can lead to conflicts. Possessing an abstract notion 
of social rank, thus, is likely to be adaptive to navigate in such systems and 
there is some evidence for this in nonhuman primates. In chacma baboons, 
Bergman et al. (2003) showed that subjects responded strongly to vocal 
interactions between two group members, provided this suggested a rank 
reversal. The effect was even more pronounced if the rank change occurred 
between individuals that belonged to different matrilines, which suggested 
a larger upheaval within the group. The conclusion was that these primates 
comprehend the invisible hierarchical social structure that is characterised 
by dominance ranks. The study has recently been partly replicated in vervet 
monkeys, with similar results (Borgeaud et al., 2013), suggesting that pri- 
mates in general represent social rank in complex ways. 

But how do primates learn to understand their own and others’ social 
rank in a group? One possibility is a cognitive operation termed transitive 
inference by which subjects observe social interactions between others to 
compute an invisible dominance matrix (e.g. Tromp et al., 2015). Another 
mechanism is more proactive, by directly challenging other group members 
that are presumed to be close in social rank. Male bonobos, for example, 
actively provoke other group members by approaching them with acousti- 
cally distinct ‘contest hoots’ combined with aggressive gestures, for the sole 
purpose of provoking a social reaction, usually agonistic chase. The social 
targets are chosen very carefully, by selecting group members of adjacent 
rank, a behaviour that also seems to function as a way to advertise the 
caller’s own social position in the group to bystanders (Genty et al., 2014). 
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4.2 Kinship and friendship 


Primate social life is not just about social rank but also about social 
relationships. In an early study long-tailed macaques were trained to dif- 
ferentiate mother-offspring pairs from other types of social dyads. In sub- 
sequent transfer trials, subjects were able to identify untrained kin dyads, 
suggesting a concept of kinship (Dasser, 1988). Other social concepts in 
primates refer to what has loosely been called ‘friendship’ (measured as 
social bonds, e.g. Silk et al. 2010), which in primates can be maintained 
over long time periods, sometimes between the sexes and independent 
of kin relations. The ability to form social bonds with nonrelatives may 
thus be one of the most important features of primate sociality, which has 
also been linked with the evolution of large brains (Dunbar and Shultz, 
2007). According to this ‘social brain’ hypothesis, forming, maintaining 
and monitoring others’ social bonds is computationally very demanding, 
suggesting that large brains evolved in response to the cognitive demands 
of sociality (Dunbar, 1993). However, this focus has come under increas- 
ing scrutiny, due to energetic constraints in maintaining brain tissue (Isler 
and Van Schaik, 2014). 

Although social bonds are apparent from patterns of social interac- 
tions, there is not much direct evidence that primates represent others’ 
social relations as distinct concepts (e.g. friend vs. foe). In an early field 
experiment, Kummer et al. (1974) investigated the mechanisms of bond 
formation between male and female hamadryas baboons. In one condi- 
tion, an observer male was allowed to watch how another male from the 
same troop interacted with a new female before being admitted to the pair. 
The striking finding was that the observer male respected the new pair 
bond, even if he was dominant over the rival. This respect of “ownership,” 
however, was not generalizable and did not transfer to situations were two 
males competed over access to food. Using playback experiments, Wittig 
et al. (2014) recently showed that, even hours after a natural aggressive 
interaction had occurred, chimpanzees were still strongly affected if they 
heard the aggressive “waa” barks of a “friend” of the former aggressor, 
which was not the case if they heard the 
members. Chimpanzees, in other words, are able to recruit memories of 


cc 


waa” barks of other group 
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past social interactions from different sources to make inferences about 
current interactions. 


4.3 Group 


Another potential social concept concerns group membership. Almost all 
higher primates form groups that are defined by individualised member- 
ships, but do they also categorise others in terms of in- and out-group 
members? In one field experiment, chimpanzees reacted to the calls of out- 
group members more strongly than to own group members (Herbinger et 
al., 2009), while males appear to assess the vocal behaviour of neighbouring 
males to numerically assess their current part size (Wilson et al., 2001). 
Again, these results indicate that primates can distinguish in- from out- 
group members but this is not direct evidence of a corresponding concept 
that could generalise to novel situations. 

In sum, there is a large literature suggesting that non-human primates 
appear to structure the real world along mental concepts akin to humans, 
such as rank, kinship, friendship and so on. In most cases, however, the 
existence of these concepts is only vaguely inferred by patterns of behaviour. 
Hence, although behaviour patterns are indicative that primates categorise 
others in terms of friends, rivals, relatives, there is no real direct evidence 
that this is based on corresponding mental concepts. Related to this, there 
is no good evidence that primates possess vocal labels for any of these puta- 
tive concepts, as it is the case for human language. 


5. Intentionality’ 
5.1 Audience effects 


There is little doubt that nonhuman primates, and probably many other 
animals, can extract useful information from signals, but this in itself is per- 
haps not so interesting. As Tomasello (2008, p. 19) puts it: “...the monkey 
has simply learned that one thing predicts another, or even causes another, 
in the same basic way as many other phenomena in their daily lives.” But 


3 Based on material published by K. Zuberbühler and J.C. Gomez, ‘Communi- 
cation, Primate Intentional ‘, in C Power (Ed.), International Encyclopedia of 
Anthropology (Wiley, in press). 
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when humans extract meaning from speech signals, this usually goes beyond 
the utterance’s (literal) linguistic meaning. Instead, semantic inferences are 
made from shared experiences with the signaller, the prior interaction histo- 
ry and, crucially, the speaker’s intended meaning. Human communication, 
thus, operates not just on what is signalled, but also on what the signaller 
intends to communicate (Grice, 1969). Recipients, in turn, assume that 
signallers intend to say something that is relevant for them, which requires 
higher levels of intentionality than simple goal directedness. 

The ability to ascribe intentions is based on the cognitive ability to at- 
tribute mental states to others, which then opens the possibly most impor- 
tant question in comparative research — do nonhuman primates base their 
communication on the mental states of others? Do they perceive others as 
independent minds with own intentions, beliefs, and knowledge? There is 
some evidence that primates can go beyond simple signal-response arith- 
metic and consider social factors when producing and responding to each 
other’s signals. Chimpanzees, for example, time the delivery of some of their 
calls to a partner’s attention, and inhibit call production in the presence of 
some audiences (Hostetter et al., 2001). For chimpanzee greeting signals, 
the pant grunts, vocal production is often inhibited by the presence of the 
top-ranking bystanders, demonstrating that call production is audience 
dependent (Laporte and Zuberbühler, 2010). Similarly, victims of aggres- 
sion can alter the acoustic structure of screams, not only depending on the 
severity of aggression experienced, but also in relation to the composition 
of the nearby audience. Victims of severe attacks tend to exaggerate the 
aggression experienced, but only if at least one high-ranking listener is in 
the audience who could interfere (Slocombe and Zuberbiihler, 2007). Other 
evidence is females who are more likely to give copulation calls if high- 
ranking males are in the vicinity and less likely if other females are nearby 
(Townsend et al., 2008). Finally, when finding food chimpanzees produce 
acoustically distinct rough grunts which attract other group members to 
the site, but again these calls are especially common if subjects travel with 
social allies or high-ranking group members (Slocombe et al., 2010; Schel 
et al., 2013a). In monkeys, audience-aware vocal communication has been 
found in predation contexts (Wich and de Vries, 2006; Papworth et al., 
2008) and during social interactions (Semple et al., 2009) and it is very 
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well documented in great ape gestural communication (e.g. Cartmill and 
Byrne, 2007). 

In sum, nonhuman primate signalling can be affected by the presence of 
others and the presumed social consequences of signal production, most 
likely because they perceive others as intentional beings. Whether or not 
the primates perceive their audiences as possessing more complex mental 
states, such as knowledge or beliefs, is not addressed by these studies. 


5.2 Mental state attribution 


Most theories of animal communication grant primate signals an im- 
perative function, whereas human communication, from an early age, 
appears to have additional declarative, informative or interrogative 
functions. This is manifested by the fact that, from a young age, human 
infants point to objects, not because they want them, but with the sole 
aim of sharing attention or information with others. Apes, in contrast, 
typically use gestures with the only purpose of requesting or directing 
others to objects or activities. For example, if apes indicate hidden tools 
or food to a human, this act is performed with the final aim of obtain- 
ing food, not just to share a target of common attention (Roberts et al., 
2014). 

A key component in any theory of language evolution thus has to do with 
how aware signallers are of the mental states of their recipients during acts 
of communication. In humans, communication is the product of intention, 
both in the sense of goal-directedness and aboutness (Brentano, 1874). Do 
primates perceive others as governed by mental states during acts of com- 
munication? If so, how complex are these mental states? Are they able and 
motivated to convey knowledge relevant to their audience? 

An early approach to study the problem was to raise great apes in an 
environment entirely structured by humans in order to document the de- 
velopment of their communication skills. One famous case study was a 
chimpanzee called ‘Sarah’, who acquired over 100 arbitrary signs-meaning 
linkages, which enabled her to generate sentence-like strings of up to eight 
units long (Premack, 1970). Similarly, two chimpanzees, ‘Sherman’ and 
‘Austin’, learned to use geometric symbols (lexigrams) to make requests 
to one another (Savage-Rumbaugh et al., 1980), while ‘Kanzi’, a male 
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bonobo learned to master a lexigram system of over 300 arbitrary symbols 
(Savage-Rumbaugh and Lewin, 1994). A related approach was to teach 
apes American Sign Language, which resulted in repertoires of over 100 
signals of the major grammatical classes. Despite these findings, the apes 
did not make much use of their acquired skills during natural social inter- 
actions and there was no good evidence that subjects demonstrated a deep 
understanding of the symbolic nature and basic grammatical rules of their 
acquired language systems (Terrace et al., 1979). 


5.3 Levels of intentionality 


Intentionality derives from Latin intendere, i.e. being directed towards a 
goal or thing (Jacob, 2014). However, people’s psychological states are not 
always about such entities, but sometimes also about other individuals’ psy- 
chological states, which makes it necessary to distinguish between different 
levels of intentionality (Jacob, 2014). Whether or not animals (or young 
human infants) can also attribute psychological states with intentionality 
to others has thus been the topic of much empirical research, usually under 
the notion of ‘theory of mind’ (Premack and Woodruff, 1978; Tomasello, 
2014). For example, apes use a rich repertoire of gestures when interacting 
with social partners, with evidence of audience awareness, social directed- 
ness, and communicative persistence (Call and Tomasello, 2007), similar 
to what has been found in pre-linguistic human infants. For vocal signals, 
a relevant finding emerged from free-ranging vervet monkeys where it was 
shown that playbacks of acoustically distinct alarm calls given to eagles, 
leopards, and pythons were enough to make others respond in adaptive 
ways, even if no actual predator was present. For instance, after hearing a 
snake alarm, monkeys responded by bipedally scanning the surrounding 
area, as if trying to locate the putative snake (Cheney and Seyfarth, 1992). 
A philosophical approach to the problem of differing levels of intentionality 
has been proposed by Dennett (1983), and called the ‘intentional stance’. 
This framework has produced landmark progress in assessing the question 
of intentionality in ethological data and relating it to intentionality in hu- 
man communication (Table 1). 
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Table 1: Dennett’s (1983) recursive grades of intentionality in animal communication 


Intention Content 

0 order A recognizes x 

1* order A wants B to do x 

2"4 order A wants B to recognize x 

3"4 order A wants B to recognize that A wants B to do x 

4" order A wants B to recognize that A wants B to recognize x 
5th order 


For the vervet monkeys, the proposal then was that, when a monkey pro- 
duces an eagle alarm call in response to which other group members run 
into cover, this may be the result of different ‘grades’ of intentionality. The 
null hypothesis, or 0-order intentionality, attributes no intentionality at all 
but callers merely react automatically to the perceptual recognition of each 
type of predator. They are simply “...prone to three flavours of anxiety or 
arousal: leopard anxiety, eagle anxiety, and snake anxiety”. Each anxiety 
has an evolved link to one call type, and listeners may form associations be- 
tween external events and call types that allow them to react appropriately. 
Signallers and recipients benefit in their own ways, but are not mentally 
connected, in a sort of “by-product semanticity”. 

First-order intentionality is different in that the monkey produces alarm 
calls with the goal to influence the behaviour of others (Table 1). Although 
there was no direct evidence for 1* order intentionality in the original stud- 
ies with vervet monkeys (Seyfarth et al., 1980), other work has suggested 
that this is within the cognitive capacities of non-human primates. For 
example, female Diana monkeys continue to produce alarm calls, until the 
group’s single adult male also produces the same alarm calls, i.e. the alarm 
call type that corresponds to the predator the females have perceived, as 
if they are soliciting the male’s acknowledgment of the situation (Stephan 
and Zuberbühler, 2016). These results indicate that primate alarm calls are 
not just automatic and direct responses to an external disturbance, but that 
calling is additionally governed by the social interaction history between 
signallers and recipients, and may require some level of intentionality in 
Dennett’s scale. More specifically, first-order intentionality requires that 
signals are produced to cause a specific and deliberate effect in an audi- 
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ence. For this to be the case, the signal needs to be socially addressed, the 
recipient’s response should be monitored and, if the desired outcome is not 
achieved, signalling should be modified accordingly (Bruner, 1981). 

There is good evidence that primates deploy some of their signals in such 
ways, particularly in the realm of gestural communication. For example, 
bonobos produce beckoning gestures with persistence and signal elabora- 
tion to persuade a sexual partner to follow to a distant location (Genty 
and Zuberbühler, 2014). This type of goal-directed intentionality is com- 
mon in ape gestural communication but also visible in facial signals, such 
as chimpanzee lip-smacking during grooming, where it appears to cause 
longer and more reciprocal grooming bouts, probably by expressing benign 
intention (Fedurek et al., 2015). 

While primate gestural communication appears to qualify as the result 
of first-order intentionality this is less clear for vocal signals. Vocal signals 
do not require visual contact between partners so it is not always clear 
whether or whom the caller wishes to address. Also, as many calls are 
given in relatively rigid context-specific ways, it is less plausible that they 
are active attempts to alter a recipient’s behaviour (let alone mental states), 
compared to being mere vocal tags of specific external events. At the same 
time, several studies have shown that primates can use vocalisations in goal- 
directed ways, specifically to influence a recipient in some way or another. 
In one key study, chimpanzees were more likely to produce alarm calls to 
snakes if their audience was unaware of the danger, suggesting some active 
assessments of others’ mental states (Crockford et al., 2012). Similarly, 
chimpanzees prior to travel sometimes give specific “travel hoos” in con- 
nection with complex departure behaviour that includes audience checking 
and other signs of goal-directedness (Gruber and Zuberbühler, 2013). 

A more complex sense of intentional communication is displayed if both 
signallers and recipients appear to take into account each other’s mental 
states when producing and responding to signals. In this situation, a sig- 
naller not only ‘wants’ a recipient to do something specific, but she also 
wants to be understood. For primate communication, there is no good 
evidence for such second order intentionality, a controversy directly linked 
to whether primates can attribute mental states to others, that is, whether 
they have a ‘theory of mind’ (Zuberbühler and Gomez, in press). In captiv- 
ity, chimpanzees have demonstrated mental state attribution, first in terms 
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of judgments about what others can or cannot see (Hare et al., 2000) and, 
more recently, what they can and cannot know (Krupenye et al., 2016), 
but these capacities have not manifested themselves in their communication 
behaviour, with a few exceptions. As mentioned before, in more natural 
situations when encountering snakes, chimpanzees direct their alarm calls 
to arriving group members, not to the snake, and call more if the arriving 
individuals are socially close to them and ignorant about the danger (Schel 
et al. 2013b; Crockford et al., 2012). Equally relevant are studies on ape 
gestures, for instance bonobos are more likely to repeat their gestures to a 
reluctant familiar than unfamiliar keeper, whereas to an unfamiliar keeper 
they were more likely to change gestures, as if taking into account the dif- 
ferences in the keepers’ knowledge rather than just perceptions (Genty et 
al., 2015). 

In sum, some of the primate gestural literature is in line with the hy- 
pothesis that 2™ order intentional states can guide primate communication, 
although it is usually possible to propose simpler explanatory models based 
on behaviour-reading and associative learning mechanisms, as a string of 
associations between behaviour cues and event outcomes. 


5.4 Ostension 


Zuberbühler and Gomez (in press) argue that there is another sense of inten- 
tional communication, a ‘Mitteilungsbediirfnis’, which is different because 
it refers to others’ mental states in a particular way (Fitch, 2010). Here, 
information transfer is not just by linguistic coding and decoding but also 
by producing ostensive indicators of meaning, which recipients use to infer 
that the sender is trying to convey a message (Grice, 1957). Humans often 
use finger pointing to this effect, something that nonhuman primates do 
not do. However, in theory almost any behaviour can serve as an ostensive 
signal, including gaze fixation, which is more common in primates (Zu- 
berbihler, 2008). Ostension, in its simplest form, may be little more than 
‘addressing’ behaviours, e.g., calling or claiming attention, something that 
has been repeatedly described in great apes. 
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5.5 Shared intentionality 


A recent theory of human evolution suggests that humans are special in their 
ability to share goals and intentions when participating in collaborative 
activities (Tomasello et al., 2005). According to this line of argument, sub- 
jects require not only powerful forms of mindreading but also a profound 
motivation to share their mental states with others to enter some kind of 
shared cognitive representation of joint intentions. There is some evidence 
that primates can engage in joint activities, which require that the intentions 
match (Pika and Zuberbühler, 2008), but whether this is equivalent to a 
sense of sharing cannot be decided. According to this theory, the implica- 
tions of being able to share intentions are enormous, enabling subjects to 
create linguistic conventions, social norms, social institutions and many 
other types of human activities (Heesen et al., in press). 


5.6 Conclusions 


There is compelling evidence that primates use both vocal and gestural 
signals in a goal-directed, first-order intentionality sense. Some further evi- 
dence suggests that there might also be second-order intentionality, but it 
remains to be determined in what sense primates attribute mental states to 
others and whether they are able to communicate about this. Primates show 
intentions to communicate in an ostensive sense, if one accepts evidence of 
addressing and seeking others’ attention as a demonstration of an osten- 
sive function. Finally, there is no good evidence of shared intentionality. 
Although great apes understand the basics of intentional action, and may 
use communicative signals to affect others’ intentions, they do not appear 
to experience as sense of ‘shared’ intentionality. 


6. Referential communication‘ 
6.1 Aboutness 


One philosophical definition of intentionality is in the sense of aboutness— 
as being about or directed at a particular object (Brentano 1874). In this 


4 Based on material published by ibid. and K. Zuberbühler and C. Neumann, 
‘Referential Communication in Nonhuman Animals’, in J. Call (Ed.), Apa Hand- 
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view, a signal is intentional if it is emitted or understood as being about an 
object (Zuberbühler and Gomez, in press). In the vervet monkey example, 
recipients may interpret an alarm call as indicating, not just that there is 
a predator, but that a sender has found a predator and that the alarm call 
is about this predator. In humans, aboutness is in that humans can think 
about non-existing things or situations, including reflecting on others’ false 
beliefs. There is no good evidence that non-human animals are able to 
engage in such mental operations. Nevertheless, some findings are relevant 
and suggest, at least, some precursor abilities. For example, in playback 
experiments, free-ranging baboons recognised when a call was directed 
at themselves as opposed to at other individuals (Engh et al., 2006). In 
chimpanzees, victims of aggression retreated from playback of aggressive 
barks given by an ally of the former opponent, but ignored the same barks 
if given by other group members, even hours after the conflict, again sug- 
gesting that they understood that the barks were about them (Wittig et al., 
2014). In sum, when attending to vocal signals, baboons, chimpanzees, and 
probably other primates take into account the target of others’ attention, 
but mostly if this is about themselves. 


6.2 Functional reference 


The vervet monkey experiments discussed earlier, together with more re- 
cent studies, amount to a considerable literature on what is often called 
‘functionally referential? communication. Empirical evidence is usually in 
the form of structurally unique signals given to identifiable events that are 
external to the signaller. ‘Functional reference’ has become one of the most 
influential models of animal communication, mainly because it is conceptu- 
ally simple and makes predictions that are easily quantifiable. According 
to the original definition (Macedonia and Evans, 1993), a functionally 
referential signal should (a) exhibit a degree of stimulus specificity and (b) 
be sufficient to allow receivers to select appropriate responses, even in the 
absence of the eliciting stimulus and other normally available cues. Vervet 
monkey alarm calling was said to fulfil both criteria and has thus become 


book of Comparative Psychology (Vol. 1. Basic Concepts, Methods, Neural 
Substrate, and Behavior: APA, 2017). 
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the textbook example of functionally referential signalling in non-human 
animals (Macedonia and Evans, 1993). 

More recent fieldwork has shown that the main carriers of meaning are 
not always the individual calls, but sometimes sequences or combinations 
of calls (Arnold and Zuberbühler, 2006; Arnold and Zuberbühler, 2008). 
Here, the evidence is not restricted to primates, but also includes studies 
on birds (Engesser et al., 2015, 2016). In addition, in some communication 
systems distinct call types can appear as call variants or compound calls, 
which are given to subtle changes in external events. For example, male 
Campbell’s monkeys produce acoustically distinct alarm calls to crowned 
eagles (‘hok’) and leopards (‘krak’), but add an acoustically distinct ‘oo’ 
suffix if the danger imposed by these two predators is not imminent (i.e., 
‘hok-oo’, ‘krak-oo’, respectively) (Ouattara et al., 2009b), an acoustic vari- 
ation that is perceived and discriminated by recipients (Coye et al., 2015). 
Moreover, the different call types can be concatenated into long call se- 
quences, which are given in context-specific ways to predation events and 
various non-predatory disturbances (Ouattara et al., 2009a), some of which 
are perceived and discriminated by recipients (Zuberbühler, 2002). 

These findings have led to an important theoretical discussion, namely 
whether animal signals are really equivalent to human referential commu- 
nication or just a by-product of evolved behaviour that grants the signaller 
with a fitness payoff. One problem is what exactly qualifies as an external 
event; the ‘stimulus specificity’ criterion. Detecting a predator or a novel 
food source are relatively obvious examples, but what about calls given 
when encountering another group member or calls given during mating? 
Chimpanzees, as mentioned, sometimes produce acoustically distinct vo- 
calisations when encountering higher-ranking group members, a reliable 
indicator of the type of social interaction the caller is engaged in. Although 
these calls refer to relatively specific social events, it is not clear whether 
they should qualify as functionally referential, mainly because the event 
(stimulus) is not strictly ‘external’ to the caller. 

Another problem with ‘functional reference’ is that animals often pro- 
duce alarm calls to a range of events that, in human terms, cannot be sub- 
sumed in a common category. For example, monkeys may give the same 
alarm call type to a range of events, which are represented by different 
mental concepts in humans. Alarm calls to terrestrial disturbances are often 
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particularly broad. Interestingly, in response to ‘terrestrial alarms’ other 
monkeys often first look in the direction of the caller, most likely to obtain 
additional behavioural information, such as gaze direction. Also, monkeys 
often take the more general context into account, when responding to 
alarm calls. For example, in response to playbacks of ‘terrestrial alarms’, 
putty-nosed monkeys spend less time looking at the caller if the alarms 
were preceded by additional information that disambiguates the cause of 
the call, such as the loud thundering sound of a falling tree (Arnold and 
Zuberbühler, 2013). 

The deeper problem is that there are virtually no empirical studies on 
whether or how wild animals organise their worlds in mental concepts, as 
discussed in a previous section. Until there are data to systematically de- 
scribe how wild animals categorise and represent the real world mentally, 
the production specificity criterion of functional reference will remain a 
contentious notion. For example, it is possible that animal vocalisations 
are not the direct product of specific mental representations, triggered by 
specific external events, but are mere indicators of behavioural intentions. 
Diana monkey eagle alarm calls, for instance, may not have evolved to 
warn others about eagle presence, but to indicate that the caller is about to 
engage with a perched eagle (Stephan and Zuberbühler, 2016). Again, fu- 
ture research should address this possibility with targeted field experiments. 


6.3 Psychological reference 


The term functional reference has also been created specifically to avoid 
discussions about underlying intention, whether animals deliver signals 
in intentional ways. The fact that most animal signals are part of a rela- 
tively hard-wired species-specific repertoire is not necessarily a problem 
for questions about referential signalling. One of the most powerful refer- 
ential signals in humans is pointing, a non-linguistic, species-specific, in- 
nate signal used to direct a recipient towards a shared experience. Human 
signallers actively seek to point out relevance to their recipients, in order 
to share common ground. Although vervet monkey alarm calls inform 
recipients about specific external events, it is unclear whether these signals 
are produced with the intention to do so, as explained earlier. Human-like 
referential communication goes beyond mere audience awareness and goal- 
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directedness, as discussed before, and usually includes some assessment of 
others’ mental states — their perceptions, intentions, knowledge or beliefs. 
Whether animals can take their audiences into account at this level is con- 
troversial. The evidence for referencing as a psychological act is currently 
best for wild chimpanzees, particularly from a series field experiments with 
detecting snake models. As mentioned, alert call production was strongly 
determined by the presence of friends but also whether recipients were 
ignorant or aware of the snake (Crockford et al. 2012, 2015; Schel et al., 
2013b). Although more targeted experiments are needed, results tentatively 
suggest that chimpanzees can take the knowledge state of a receiver into 
account, as opposed to mere perceptions, when referring others to a relevant 
external event. 


7. Communication as social bonding’ 


The biological success of our species is partly grounded in a major evo- 
lutionary transition in mental capacities from self-serving, competitive to 
group-oriented, cooperative social relationships. Compared to other pri- 
mates, humans are much more collaborative, prosocial, and amenable to 
social norms, which has far-reaching implications at almost every level 
of human activity, including language (Tomasello, 2014). Human social 
interactions are highly structured joint activities during which partners 
are in tune with each other’s intentions -- the human ‘interaction engine’ 
(Levinson, 2006). Linguistic discourse is a specific type of joint activity, 
which is characterized by rapid exchanges of syntactic units, while its se- 
mantic content is often about others’ social behaviour, or gossip, and this 
exchange of social information appears to have a strong bonding effect on 
the interlocutors (Dunbar, 1993). 

In non-human primates, social bonding is mainly achieved by manual 
grooming. In wild chimpanzees, for example, urinary oxytocin levels are 
higher after grooming with bond partners than other group members (Crock- 
ford et al., 2013) and there is also good evidence that grooming has a direct 


5 Based on material published by K. Zuberbühler and P. Fedurek, ‘Vocal Groom- 
ing’, in Todd K. Shackelford and Viviana A. Weekes-Shackelford (Eds.), Ency- 
clopedia of Evolutionary Psychological Science (Berlin: Springer, in press). 
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impact on an individual’s fitness. In primates, grooming increases the prob- 
ability that partners will help each other during fights, whereas individuals 
with strong social networks have higher life-time reproductive success than 
less socially integrated individuals (Silk et al., 2010). Across species, there is 
a positive relation between social grooming time and group size, suggesting 
that living in large groups leads to relatively more time demands for manual 
grooming, most likely due to an exponentially increased number of social 
relationships. This has led to the hypothesis that, during human evolution, 
manual grooming has been gradually replaced by vocal grooming as the main 
mechanism for social bonding, which has paved the way for the evolution 
of language (Dunbar, 1993). Vocal grooming, in other words, has become 
something like the missing link in this theory of language evolution. 

How did the transition from manual to vocal grooming take place? One 
candidate for vocal grooming in primates is chorusing. In chimpanzees, 
males of the same group chorus with their pant hoots, an acoustically 
unique long-distance vocalization. There is evidence that these joint vocal 
displays facilitate tolerance at feeding sites and predict support in agonistic 
interactions on a short-term basis, while manual grooming seems to be a 
better predictor of long-term social bonds (Fedurek et al., 2013). Another 
candidate is vocal convergence, usually seen between affiliated individuals 
that adjust the acoustic structure of their calls during vocal interactions 
(Candiotti et al., 2012). In humans, vocal convergence is a well-documented 
phenomenon, described as ‘speech accommodation’, although similar ef- 
fects are also seen in gestures, suggesting that people try to either emphasise 
or minimise the social difference to their interlocutors, reflecting their desire 
or refusal to strengthen a social bond (Clark and Schaefer, 1989). Another 
candidate that might have paved the way for a transition from manual to 
vocal grooming is call exchanges, as found in many primates, and often 
between closely affiliated individuals. Call exchange networks tend to be 
socially more rigid than grooming networks, insofar as socially important 
individuals are more likely to elicit vocal responses than others. Related 
to call exchanges is duetting, a vocal behaviour seen in gibbons and other 
monogamous primates. Duetting has a presumed bonding function (Geiss- 
mann, 2002; Filippi, 2016), but the behaviour is relatively inflexible in the 
sense that once a relationship is established it simply functions to broadcast 
this fact to potential sexual rivals. Vocal turn-taking is also a key feature 
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of human linguistic discourse, which is characterized by highly structured, 
rapid exchanges of short syntactic units, around 1500 per day (Levinson, 
2016). In humans, the length of turns can be flexible, as well as the num- 
ber of speakers, but participants rigidly observe short time gaps of about 
200ms between turns. This highly conserved pattern requires participants 
to predict when a partner’s turn comes to an end, which is only possible if 
interlocutors can anticipate the forthcoming semantic content, while con- 
structing their next utterance. This pattern is found across languages, sug- 
gesting that it is based on a biological predisposition, with possibly deep 
evolutionary roots (Levinson, 2016). 

Recent empirical research has suggested another candidate behaviour 
for the transition from manual to vocal grooming. In chimpanzees, manual 
grooming is often accompanied by a specific acoustic signal, lip-smacking, 
especially between closely affiliated individuals. Lip-smacking plays a key 
role in coordinating grooming bouts by making them longer, more recipro- 
cated and more intimate (Fedurek et al., 2015). Many primates lip-smack 
during affiliative social interactions (Bergman, 2013), but so far there is only 
evidence in chimpanzees for a direct link between this behaviour and man- 
ual grooming (Zuberbühler and Fedurek, 2017). Chimpanzee lip-smacking 
has thus become somewhat of the prime candidate as the missing link to 
human vocal interactions. While chimpanzee lip-smacking is semantically 
empty, it similarly seems to function in social bonding and is delivered in 
temporarily structured ways. One evolutionary scenario therefore is that 
lip-smacking may have served as a precursor for the evolution of articu- 
latory control (Ghazanfar et al., 2012). As argued earlier, there is good 
evidence that great apes have substantial control over their supra-laryngeal 
vocal tract but relatively little active control over the larynx (Lameira et 
al., 2014). The transition from manual to vocal grooming, in other words, 
may have been linked with gaining active control over sound production 
in line with concurrent underlying changes in the cognitive architecture. 


8. General conclusions 


Human communication is strikingly different from any other known natural 
communication system. From an evolutionary perspective, this is particu- 
larly striking because, biologically, humans are primates whose communi- 


254 Klaus Zuberbühler 


cation system has evolved during a long and shared phylogenetic history. 
One way to investigate the roots of human language is with comparative 
studies of primate cognition, particularly the basic processes required for 
language production and perception. 

Humans are clearly unique in the amount of control they have over 
sound production, and this has considerable effects on other important 
features of language, such as vocal imitation, phonemic repertoires and the 
ability to produce syntactic structures. Humans are the only primate species 
capable of controlling and socially learning their vocal output, which as a 
consequence becomes part of shared communicative conventions. Although 
non-human primates and most other animals do not have active control 
over their vocal output, some studies suggest that they can use parts of their 
signal repertoire in referential ways, to inform others about relevant events 
in their environment. Current evidence also suggests that the primate vocal 
tract is essentially speech-ready, and that great apes have good voluntary 
control over most speech articulators, apart from the larynx. Why and how 
humans have evolved the ability to control the larynx is an open question, 
but it has been suggested that this has evolved in the context of cooperative 
breeding (Zuberbühler, 2011). 

Language is also a cognitive system based on conceptual thought and the 
ability to categorise the world in mental concepts. Here, the evidence for 
equivalent structures in nonhuman primates is not strong, although much of 
their behavioural patterns suggest similar representational organisation. A 
key feature of any language definition is that, during acts of communication, 
signallers draw their recipients’ attention to what they consider relevant 
entities, both real or imagined. Humans routinely refer others to what they 
perceive as relevant aspects of the world, a psychological propensity that 
was probably one of the major drivers of language evolution. Linguistically, 
this is achieved with arbitrary acoustic conventions, but referring can also 
happen non-linguistically, for example with iconic gestures or pointing. In 
humans, this ability emerges early during development, mainly in the form 
of pointing that functions, similar to linguistic expressions, as a referen- 
tial signal. An important evolutionary question therefore is whether this 
characteristically human way of communicating is also present in some 
animal signalling and, if so, whether the underlying cognitive processes 
are similar. While in humans referential communication appears to be the 
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result of mental state attributions, in the sense that signallers communicate 
content relevant to their audience, only very few studies in animals have 
found evidence of this kind, all of them in wild chimpanzees communicating 
about food or danger. In human communication, in contrast, there seem 
to be no such contextual limitations, a possible consequence of our highly 
cooperative nature. 
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What Gestures of Nonhuman Primates Can 
(and Cannot) Tell Us about Language Evolution 


Abstract: There is a variety of different evolutionary scenarios hypothesizing how 
human language might have evolved. While some suggest that language evolved 
from scratch in humans only, others propose that precursors to human language 
were already present in our shared last common ancestor. Consequently, compara- 
tive researchers suggest that at least some of the abilities necessary for language to 
evolve are shared with other primates. However, which aspects of primate com- 
munication are studied to shed light on language evolution heavily depends on 
which communicative modality is studied. This chapter focuses on the gestural 
communication of nonhuman primates. The aim is contribute to the ongoing debate 
about how language might have evolved by evaluating findings from comparative 
research on the different building blocks of language, and by discussing how these 
data support a gestural origin of human language. 


Keywords: nonhuman primate communication, gestural communication, language 
evolution 


1. Introduction 


Many theories of language evolution are based on comparative evidence of 
the communicative abilities of our closest relatives, the nonhuman primates. 
In searching for the roots of human language, researchers therefore aim at 
identifying potential precursors to language in other primate species. The 
aims of this chapter are, first, to give an overview of the building blocks 
of language, which are usually studied by researchers proposing a gestural 
origin of human language, such as intentional and flexible use of gestures, 
mechanisms of gesture acquisition, and the potential for rule-governed, 
meaningful combinations. Second, for each of these characteristics, the 
current evidence from gestural communication of apes and monkeys is 
evaluated to answer the question if these findings support a gestural origin 
of language. The chapter will close by suggesting some future directions for 
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research into primate gestural communication necessary to inform theories 
of language evolution in more comprehensive ways. 


2. Different approaches to language evolution 


Although there is general agreement that language is unique to the human 
species, there is much debate about whether language is fundamentally dif- 
ferent from other animals’ communication systems. Thus, some scholars 
propose that language evolved from scratch in humans only and that we 
can therefore not gain any knowledge about language evolution by compar- 
ing human language to the communicative systems of other, even closely 
related species (Hauser et al., 2014). Animal communication is suggested 
to be fundamentally different, both with regard to its structural proper- 
ties as well as cognitive foundations (Bickerton, 1992; Chomsky, 1966; 
Scott-Phillips, 2015) and comparisons should be limited — if conducted at 
all — to comparing the nonverbal “gesture-call systems” of humans to that 
of nonhuman primates (Burling, 1993). 

Others argue that it is highly unlikely that a trait as complex as language 
evolved in such a short time in the human lineage only (Pinker and Bloom, 
1990) and that language built on traits already present in our shared com- 
mon ancestor, including neurobiological substrates (Arbib, 2005, 2016), 
anatomical structures such as the vocal apparatus (Fitch et al., 2016; Riede 
et al., 2005), or cognitive skills (Seyfarth et al., 2005). Therefore, propo- 
nents of this continuity approach suggest that a comparative approach, 
which investigates the communicative abilities of nonhuman primates, is 
useful to identify potential precursors to human language in our closest 
relatives (King, 1999; Tomasello, 2008; Zuberbühler, 2005). 

It is important to highlight that the answer to the question whether 
proposed differences between human language and animal communication 
are a matter of degree or kind heavily depends on how language is defined. 
However, a universally accepted definition, which covers the many differ- 
ent facets of language, has not been achieved yet. Hauser and colleagues 
(2002) suggested that for studying how language might have evolved, “... 
it is unproductive to discuss language as an unanalyzed whole” (see also 
Fitch, 2005, p. 194). Rather, it is important to decompose language into its 
many different mechanisms necessary for language to emerge, and to differ- 
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entiate between more general mechanisms involved in language processing, 
which are potentially shared with other species (“faculty of language in the 
broad sense”) compared to those that are specific to language and uniquely 
human (“faculty of language in the narrow sense”) (Hauser et al., 2002). 

Importantly, which aspects of primate communication are studied to 
identify those abilities shared across primates, and to differentiate them 
from those unique to humans, heavily depends on whether researchers 
are interested in vocal, gestural, or facial communication. Thus, the over- 
whelming majority of existing research into primate communication uses 
a unimodal approach, to study one specific modality in isolation, while 
ignoring other modes of communication (Slocombe et al., 2011). Therefore, 
in searching for the origins of human language, a major debate centers on 
the question which communicative modality provided the starting point for 
language to emerge, with scholars defending a gestural, vocal, or orofacial 
origin. Proponents of each origin usually focus on one specific modality, 
and use their findings to argue why the corresponding other modalities 
are not suitable candidates to explain how language might have emerged 
(Slocombe et al., 2011). 

The focus of this chapter is on primates’ gestural communication, but it 
is important to note that depending on the type of communicative modality 
studied, different aspects of primate communication are investigated to shed 
light on the origins of human language. Therefore, in the following section, 
those aspects of primate gestural communication are introduced, which are 
usually studied to find evidence for a gestural origin of human language. 


3. Which aspects of primate gestural communication are 
studied to find evidence for a gestural origin of human 
language? 


The first of modern gestural theories, which heavily influenced other the- 
ories suggesting a gestural origin, was the Gestural Primacy Hypothesis 
postulated by Hewes (1973). In general, hypotheses proposing a gestural 
origin of human language assume that spoken language was preceded by 
a gestural stage and that our ancestors therefore initially communicated 
by using voluntarily produced manual gestures (Armstrong et al., 1995; 
Corballis, 2002; Hewes, 1973; Tomasello, 2008). 
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Since intentional production is key to human language, comparative 
researchers supporting gestural scenarios of language evolution focus on 
gestures of nonhuman primates, because primates use them intentionally 
in a purposeful, goal-directed way and are therefore able to voluntarily 
control their production. Gesture researcher usually argue that in contrast 
to gestures, the majority of vocalizations and facial expressions are invol- 
untary expressions of internal emotional states, and that it therefore seems 
unlikely that language has emerged from these communicative modalities 
(Tomasello, 2008). A second heavily investigated aspect of primate gestural 
communication is the usage of gestures across different social contexts, 
representing a marker for the flexibility of this modality. Thus, gesture 
research is interested in those gestures, which are flexibly used across dif- 
ferent contexts to achieve different social goals (Arbib et al., 2008; Call and 
Tomasello, 2007), in contrast to vocal research, which focuses mostly on 
context-specific vocalizations, such as food-related or predator-specific calls 
(Kalan et al., 2015; Murphy et al., 2013; Schel et al., 2013a; Zuberbühler, 
2001). These different research foci result in different conclusions across 
modalities: because gesture researcher focus on the flexibility of gesture 
usage, they conclude that gestures have no inherent meaning, since the in- 
formation they convey is defined by the context in which they are used (Call 
and Tomasello, 2007). On the other hand, the focus of vocal researchers on 
context-specific vocalizations leads to the conclusion that most vocalizations 
have specific meanings (Seyfarth et al., 1980; Slocombe and Zuberbühler, 
2005; Zuberbühler, 2000), and that they can be combined into meaningful 
sequences (Ouattara et al., 2009; Zuberbühler, 2002). Therefore, unlike 
vocal researchers, gesture researchers are less interested in the questions 
if primate gestures have specific meanings and if they are combined into 
meaningful sequences. However, there is an increasing number of studies 
also addressing these issues (Cartmill and Byrne, 2010; Hobaiter and Byrne, 
2014; Liebal et al., 2004a). Finally, gesture researchers study the mecha- 
nisms of gesture acquisition during ontogeny, to investigate when and how 
gestures emerge and whether gestural repertoires vary between individuals 
(Arbib, 2016; Liebal et al., submitted). It seems that primates can create new 
gestures and incorporate them into their repertoires (Goodall, 1986). They 
are therefore able to increase the number of gesture types they are using, 
resulting in more open gestural repertoires. Vocal or facial repertoires, on 


Gestures of Nonhuman Primates Can Tell Us about Language Evolution 269 


the other hand, seem to be closed and species-specific, since primates are 
not able to acquire new, additional signals (Owren et al., 1992). 

In sum, to find evidence for a gestural origin of human language, gesture 
researchers focus on the intentional production and flexible usage of ges- 
tures, they investigate if gestures have specific meanings and are combined 
into meaningful sequences, and how they emerge in ontogeny. Before the 
remaining chapter evaluates the current evidence for each of these different 
features of primate gestural communication, the next two sections briefly 
describe how the term gesture is defined, and how the field of gesture re- 
search has developed over the last decades. 


4. What is a gesture? 


Primate gestures are considered voluntarily produced, purposeful behaviors, 
directed at specific individuals to influence their behavior (Benga, 2005). 
Criteria applied to identify such intentional use include that their commu- 
nicative behavior depends on the presence of an audience, and that they are 
tailored to the recipient’s behavior, like their attentional state or response 
to the other’s communicative attempts (Leavens et al., 2005b). Regarding 
gesture modality, visual gestures are differentiated from tactile and auditory 
(or audible) gestures. Visual gestures are distant signals and include manual 
gestures, such as “extend arm”, but also body postures, such as “present”. 
Tactile gestures, such as “slap” or “gentle touch”, involve the physical con- 
tact between two interacting partners, but are motorically ineffective. That 
means that if an individual wants another one to follow her, grabbing the 
other’s arm and dragging him along would not count as a gesture, since this 
was not accompanied by waiting for the other’s response to this behavior. 
However, briefly pulling the other’s arm and then waiting for his response 
to follow is a potential gesture, because the recipient has the chance to re- 
spond — or not. Auditory gestures generate a sound, which can be produced 
with different body parts, such as clapping hands or slapping the chest or 
belly (Kalan and Rainey, 2009; Pika, Liebal, and Tomasello, 2003). 

It is important to note that the definition of a gesture and its discrimi- 
nation from other signal types is often confusing. Gestures include visual, 
tactile and auditory signals, and they may therefore overlap with the sensory 
modalities of other signal types. For example, like vocalizations, auditory 
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gestures are acoustic signals, but they do not engage the vocal cords. Visual 
gestures are considered a different signal type than facial expressions (Call 
and Tomasello, 2007; Pollick and de Waal, 2007), although they both rely 
on the visual modality. This indicates that signal types are not necessarily 
classified based on their sensory channels, but that they are differentiated 
based on the traditional dichotomy between voluntarily produced gestures 
in comparison to more reflexive, involuntarily produced vocalizations and 
facial expressions (Liebal et al., 2013b; Tomasello, 2008). 


5. How did the field of gesture research develop? 


Initially, gestural communication was not considered a separate communi- 
cative modality, but gestures were mentioned as components of the general 
behavioral repertoire describing a given species. These very first studies 
focused primarily on primates in their natural habitats, such as baboons 
(Kummer, 1968), gibbons (Carpenter, 1940), orangutans (MacKinnon, 
1974; Rijksen, 1978), gorillas (Schaller, 1963), chimpanzees (Goodall, 
1986) and bonobos (Kuroda, 1980). The first studies on captive apes were 
conducted by van Hooff (1973), who investigated chimpanzees, and de 
Waal (1988), who compared chimpanzees and bonobos. These studies fo- 
cused on establishing gestural repertoires and provided detailed descriptions 
of each gesture and the contexts of their use. 

A more psychological approach, applying definitions derived from de- 
velopmental psychology (Bates et al., 1979; Leavens et al., 2005b), was 
initiated by Michael Tomasello, Josep Call and their colleagues (Tomasello 
et al., 1994; Tomasello et al., 1997; Tomasello et al., 1989). Instead of 
focusing on phylogenetically ritualized signals and their adaptive function 
(Maynard-Smith and Harper, 2003; Smith, 1977), they were interested in 
proximate factors and therefore more cognitive aspects of primate commu- 
nication. In their initial series of studies, they used observational methods to 
study different groups of captive chimpanzees (Tomasello et al., 1985). This 
work was later extended to other primates in captive settings, such as goril- 
las (Pika et al., 2003), bonobos (Pika et al., 2005), Sumatran orang-utans 
(Liebal et al., 2006), and one species of small apes (Liebal et al., 2004c). 

Maestripieri conducted the very first systematic studies on the gestural 
repertoires of captive rhesus, pigtail and stump-tail macaques. Interestingly, 
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these species differ in their social systems and rank relationships (Maes- 
tripieri, 1996a, 1996b, 1997; Maestripieri and Wallen, 1997). A major 
conclusion from these studies is that gestural communication varies as a 
function of the social system, with more despotic species using a more 
limited repertoire of gestures, mostly in the contexts of submission and as- 
sertion, while more egalitarian species seem to offer more opportunities for 
negotiating social relationships, evident in the high variety and variability 
of gestures in affiliative contexts (Maestripieri 1999, 2005). 

The first systematic studies on wild populations were conducted by Emily 
Genty and Cat Hobaiter, together with Richard Byrne, who studied the ges- 
tural repertoires and their usage in gorillas (Genty et al., 2009) and chim- 
panzees (Hobaiter and Byrne, 2011a). In line with the results from captive 
research, they found that gorillas and chimpanzees used their gestures in- 
tentionally, in many different contexts, and that they adjust them to the re- 
cipient’s behavior. However, they found a greater repertoire (> 100 gestures) 
than previously reported for captive gorillas (Pika et al., 2003), with limited 
individual variability and large overlap of individual repertoires, even across 
species. They therefore concluded that gestural repertoires are species-specific 
and might even be shared across species, indicating that they are most likely 
biologically determined (Genty et al., 2009; Hobaiter and Byrne, 2011a). 

Experimental settings, which include the communication between a great 
ape or monkey with a human experimenter, are frequently used to study 
both the production as well as perception of gestural signals. The basic 
procedure across these studies is that primates are required to request food 
from the human, which is either visible or hidden, and which they cannot 
obtain themselves. Thus, primates need to use the human as a tool to get the 
food, since they are not able to reach the food by themselves. Much atten- 
tion has therefore been paid to gesture types produced in this experimental 
setting. At least great apes readily learn to use pointing gestures, such as 
whole hand or index finger pointing to request food (Call and Tomasello, 
1994; Leavens and Hopkins, 1999; Leavens et al., 1996). There is increas- 
ing evidence that different monkey species also use pointing gestures (An- 
derson et al., 2007; Kumashiro et al., 2002; Meguerditchian and Vauclair, 
2006; Mitchell and Anderson, 1997), but these studies often involve special 
training, possibly indicating that monkeys’ pointing behavior might be less 
flexible than that of great apes. 
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Experimental settings have also been widely used to study primates’ 
understanding of human gestures, specifically pointing gestures, for ex- 
ample, by applying the object-choice paradigm (for a review, see Miklósi 
and Soproni, 2006). After hiding a food item under one of two (or more) 
containers, with the baiting process invisible for the primate, the experi- 
menter points to the location where the food is hidden. Even if this gesture 
is combined with additional cues, such as eye gaze or gaze alternation and 
orienting the body towards the baited container, primates usually fail to use 
this information to find the food (Anderson et al., 1996; Call et al., 2000; 
Povinelli et al., 1999). Whether these negative findings can be explained 
by methodological issues (Barth et al., 2005; Mulcahy and Hedge, 2012), 
or whether this shows that primates do not understand others’ pointing 
gestures as cooperative communicative acts helping them find the food 
(Hare and Tomasello, 2004), is currently unclear. 

While this section focused on changes of research interests over time, the 
following part of this chapter specifically targets those aspects of primate 
gestural communication, which are currently investigated to find evidence 
for a gestural origin of human language. 


6. What is known about primate gestural communication? 


In the following section, the current knowledge of research into primate 
gestural communication is summarized and it is discussed if they provide 
support for a gestural scenario of language evolution. 


6.1 Intentional use 


To identify intentionality in primate gestures, several criteria have been 
adapted from research into gesture use of prelinguistic children (Leavens et 
al., 2005b). These criteria include the social use of gestures in the presence 
of an audience, the adjustment of gesture use depending on the recipient’s 
behavior, as well as the persistent and elaborated gesture use in case the 
recipient does not respond. However, there is currently no agreement about 
which and how many of these markers of intentionality are necessary to 
define a gesture (Liebal et al., 2013b). In the following, each of these dif- 
ferent markers of intentionality will be briefly discussed with regard to the 
existing findings. 
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Considering the social use of gestures in the presence of an audience, 
many experimental studies have demonstrated that in interaction with hu- 
mans, great apes as well as several monkey species only gesture if an audi- 
ence is present (Anderson et al., 2010; Blaschke and Ettlinger, 1987; Call 
and Tomasello, 1994; Hattori et al., 2010; Hostetter et al., 2001; Poss et 
al., 2006). Many of these and other studies showed that great apes and 
monkeys adjust their gesture use to the attentional state of the human, and 
only use visual gestures if the human is oriented towards them (Bourjade et 
al., 2014; Hattori et al., 2010; Liebal et al., 2004b; Maille et al., 2012; but 
see Povinelli and Eddy, 1996). If the human is present, but not attending to 
them, some studies reported that apes are more likely to use auditory signals 
to attract the human’s attention (Cartmill and Byrne, 2007; Hostetter et al., 
2001; Leavens et al., 2004; Poss et al., 2006), while other studies failed to 
find similar results (Liebal et al., 2004b; McCarthy et al., 2013; Theall and 
Povinelli, 1999). However, when great apes were given the opportunity to 
change their position in relation to that of the human experimenter, they 
preferred to move into the human’s visual field instead of attracting the 
human’s attention by using auditory signals (Lieba et al., 2004b). 

Some researchers investigated if chimpanzees use their gaze — instead of 
gestural signals — to direct the human’s attention to the food they want to 
obtain, by repeatedly looking back and forth between the human and the 
food item (Leavens and Hopkins, 1998; Tomasello et al., 1994). For exam- 
ple, Leavens and Hopkins (1998) found that at least 80% of the chimpan- 
zees’ indicative gestures (food beg and pointing gestures) were accompanied 
by such gaze alternation. Interestingly, this proportion even increased to 
100% if chimpanzees vocalized simultaneously when producing a gesture. 
According to Leavens and Hopkins (1998), it remains an open question 
if the occurrence of gaze alternation depended on the number of different 
communicative means involved (gestures and/or vocalizations), or whether 
an increased proportion of gaze alternation and types of communication 
rather reflected their high motivation to get the food — and therefore the 
intensification of their communicative attempts. It is important to note that 
although some researchers use gaze alternation as markers of intentional 
communication (Leavens et al., 2005b; Schel et al., 2013b), an alternative 
explanation might be that primates simply gaze back and forth to check 
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if the food is still there — rather than intentionally directing the human’s 
attention towards the food (Liebal et al., 2013b). 

In case the recipient does not respond to another’s gesture, the gesturer 
can either repeat the very same gesture (persistence) or use a different, 
potentially more efficient gesture (elaboration) to elicit a response. Both 
persistence and elaboration are considered markers of intentional use, be- 
cause they indicate that primates can use their gestures flexibly to achieve 
their goals (Liebal et al., 2013b). Regarding persistence, several studies have 
shown that both captive and wild chimpanzees continue to gesture in case 
there is no response to their initial gesture (Hobaiter and Byrne, 2011a; 
Liebal et al., 2004a; McCarthy et al., 2013; Roberts, Vick, and Buchanan- 
Smith, 2013), while this was not found for gorillas and orangutans (Genty 
and Byrne, 2010; Tempelmann and Liebal, 2012). Therefore, evidence for 
persistence in case of failed communicative attempts is mixed, and varies 
across great ape species. Similarly, research into elaborated gesture use 
produced inconsistent findings. While some studies failed to find evidence 
for the use of more efficient gestures in case of communicative failure (Gen- 
ty and Byrne, 2010; Liebal, Call, et al., 2004a; Tempelmann and Liebal, 
2012), an increasing number of studies with both wild and captive great 
apes shows that they are able to modify their gestures in case their goal is 
not or only partially met (Cartmill and Byrne, 2007; Leavens et al., 2005b; 
Roberts et al., 2013). 

Taken together, gestures are, by definition, intentionally used signals, 
although the type and number of criteria used to access intentionality varies 
drastically across studies. Furthermore, while there is convincing evidence 
that both monkeys and great apes consider the presence of an audience and 
adjust their gestures to the attentional state of the recipient, findings for the 
use of attention-getting gestures to capture others’ attention as well as for 
the persistent and elaborated gesture use if the recipient does not respond 
are rather inconsistent. Still, gesture researchers conclude that unlike vo- 
calizations and facial expressions, gestures are intentionally used signals, 
and interpret this as evidence for a gestural origin of human language (To- 
masello, 2008). This traditional dichotomy between intentional gestures 
in contrast to involuntarily produced vocalizations and facial expressions, 
however, is increasingly challenged by several studies (Crockford et al., 
2012; Scheider et al., 2016; Schel et al., 2013b; Waller et al., 2015). The cur- 
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rent lack of evidence for the intentional use of vocalizations and facial ex- 
pressions as compared to the gestural modality might indicate “blind spots” 
in vocal and facial research, because of the different research interests across 
modalities (Slocombe et al., 2011) (see section 2). Therefore, although ges- 
ture researchers generally highlight that primates’ ability to control the 
production of their gestures, but not vocal and facial signals, is important 
evidence for a gestural origin of human language (Tomasello, 2008), this 
conclusion might be too premature because of the current knowledge gaps 
regarding other communicative modalities (Slocombe et al., 2011). 


6.2 Flexibility of use 


Flexibility in gestural communication can be considered in different ways. 
While this section will specifically address the usage of gestures across differ- 
ent contexts, the combination of gestures into longer, potentially meaningful 
sequences is discussed in a separate section. 

The term “means-end dissociation” (Bruner, 1981) has been adapted by 
primate gesture research to describe the use of one gesture in several con- 
texts, or vice versa, the use of several gestures within one specific context 
(Tomasello et al., 1994). Therefore, this measure describes the extent of flex- 
ibility present in the use of a given repertoire for a variety of goals primates 
want to achieve. For example, great apes and siamangs in captive settings 
have been demonstrated to use between 50% and 75% of their gestures in 
two or more contexts (Liebal et al., 2004c; Liebal et al., 2006; Pika et al., 
2003, 2005; Tomasello et al., 1997). For wild gorillas, it has been shown 
that they use 10 of their most common gestures in several contexts (Genty 
et al., 2009). Across species, the highest variety of gestures used within one 
specific context was observed in the play context. Orangutans also perform 
many of their gestures in the context of affiliation and in interactions about 
food (Liebal et al., 2006), whiles bonobos and chimpanzees use a high vari- 
ety of gestures in the agonistic context (Pika et al., 2005; Tomasello et al., 
1997). Tactile gestures are used more frequently for several purposes, while 
visual gestures are more likely to occur in a specific context, such as groom- 
ing, sexual behavior, or requesting food (Liebal et al., 2004; Liebal et al., 
2006; Pika et al., 2003). Thus, visual gestures seem to function more often 
as “intention movements” (Tinbergen, 1952), which are used for a specific 
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purpose, and which represent abbreviations of full-fledged behaviors (To- 
masello et al., 1989). Attention-getting gestures, on the other hand, which 
are not necessarily used to capture others’ attention, but to trigger them 
into action, are mostly tactile and auditory gestures (Liebal and Call, 2012). 

A study with captive bonobos and chimpanzees found that the majority 
of their gestures is not associated with specific contexts, while facial and 
vocal signals are more likely to be context-specific (Pollick and de Waal, 
2007). The authors concluded that their findings support the hypothesis of 
a gestural origin of language, since in contrast to facial and vocal signals, 
gestures are less context-specific and thus less tied to the (involuntary) 
expression of emotions, which “...makes gesture a serious candidate mo- 
dality to have acquired symbolic meaning in early hominins” (Pollick and 
de Waal, 2007, p. 8185). 


6.3 Gesture combinations 


In searching for the roots of human language, gesture researchers have been 
interested in gesture combinations (or sequences) for two different reasons. 
First, “duality of patterning” is considered one of the design features of 
language (Hockett, 1960) and refers to the combination of a finite number 
of meaningless sounds (phonemes) into meaningful units (morphemes and 
words), which can be further combined into an (unlimited) number of 
utterances (phrases, sentences). “Productivity” is another design feature 
of human language and describes the ability to create new meanings by 
combining already existing utterances (Hockett, 1960). Therefore, much 
attention has been paid to the question if primates combine their signals into 
consecutive, meaningful sequences, which might have a different meaning 
than their single components, and whether such combinations are based 
on specific, potentially grammatical rules. The second reason for studying 
gesture combinations is that “...the extent to which animals have the flex- 
ibility to move beyond single signal production to combine signals into 
sequences is highly indicative of the potential for complexity in the system.” 
(Liebal and colleagues (2013b, p. 317). Thus, although primates are able to 
create at least some new gestures and incorporate them into their individual 
repertoires, gesture combinations offer the possibility to further increase 
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the flexibility of a rather limited gestural repertoire, by combining single 
gestures into larger sequences. 

Gestures sequences are commonly defined as multiple gestures produced 
one after the other by one individual, towards the same recipient and the 
same goal (Liebal et al., 2004a; Tomasello et al., 1994). Studies vary, how- 
ever, regarding the time interval in which one gesture must follow another 
one to be considered part of a consecutive sequence. For example, some 
studies used a 1 second time interval between two gestures (Genty and By- 
rne, 2010; Hobaiter and Byrne, 2011b), while other studies used a 5 second 
interval (Liebal et al., 2004a; Tempelmann and Liebal, 2012), or even up 
to 30 seconds (Roberts et al., 2013). This is important, since the observed 
occurrence and patterns of gesture sequences may vary drastically because 
of these different definitions. 

The few existing studies all focused on great apes, in both captive and 
natural settings. For captive chimpanzees, it has been found that about one 
third of their gestures occur as part of a sequence, with two thirds of se- 
quences consisting of two consecutive gestures (Liebal et al., 2004a). Almost 
40% of these sequences are repetitions of the very same gesture. Consider- 
ing the potential meaning of these combinations, there is no evidence that 
sequences are used for different functions than their single components, 
suggesting that gestures are not combined in meaningful ways to create new 
meanings. Rather than representing planned sequences used to increase the 
efficacy of certain gestures (e.g., by first using an attention-getting gesture 
to direct the recipient’s attention towards the following gesture), it seems 
likely that sequences merely emerge because the recipient does not — or not 
appropriately — respond to the other’s communicative attempts (Liebal et 
al., 2004a; McCarthy et al., 2013). 

Similarly, for wild chimpanzees, Roberts and colleagues (2013) re- 
ported that they repeat gestures and even elaborate gesture use in case 
the recipient does not respond as expected way. Hobaiter and Byrne 
(2011b) used a slightly different approach and found that sequences of 
wild chimpanzees represent a way to increase communicative efficiency. 
Thus, gesture use changes over development, with chimpanzees shifting 
from using long, often redundant “rapid-fire gesture sequences” to more 
selective and efficient gesture use. While “rapid-fire gesture sequences”, 
which consist of many different gestures and are produced without any 
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response-waiting, are largely used by younger individuals, “bouts” often 
occur if the recipient does not react. In this case, chimpanzees are more 
likely to repeat the same gesture and wait for the recipient’s response 
(Hobaiter and Byrne, 2011b). 

For gorillas, a different picture emerged (Genty and Byrne, 2010). Like 
captive chimpanzees, they combine about one third of their gestures into 
sequences, mostly consisting of two gestures. There is also no evidence 
that sequences convey different meanings than the single gestures they are 
composed of. However, in contrast to chimpanzees, sequences in gorillas do 
not emerge because of an unresponsive recipient. Instead, Genty and Byrne 
(2010) suggest that gesture sequences in gorillas, which occur largely in the 
play context, provide the means to adjust their ongoing interactions. Tanner 
(2004), in her qualitative analysis of gesture sequences exchanged between 
a zoo-living pair of gorillas, came to a different conclusion, because she 
did not observe that gestures were repeated if there was no response to the 
initial gesture. Instead, she found that gorillas incorporate iconic gestures 
into their sequences, which depict the requested motion or the destination 
of the requested action, like the “armswing under”, which ended touching 
the area between the legs, which according to Tanner (2004) indicates the 
location for sexual play. Gorillas seem to use such sequences and specifically 
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the incorporated iconic gestures to communicate “...decisions on when, 
where and how to play...”, which “...evolve gradually in the course of 
interaction.” (Tanner, 2004, p. 18). 

Finally, captive Sumatran orang-utans apparently seem to not consider 
the recipient’s behavior, since they continue to gesture regardless of the 
recipient’s response (Tempelmann and Liebal, 2012). It has therefore been 
suggested that sequences in orang-utans do not emerge because of the lack 
of a response, but they might be rather the result of high arousal, because 
they largely occur in the play context (Tempelmann and Liebal, 2012). 

Taken together, although several great ape species have been reported 
to combine gestures into longer sequences, existing research provides little 
evidence that they represent meaningful combinations. Thus, at least in the 
gestural modality, there is no evidence for productivity and most likely also 
not for duality of patterning. However, note that up to date, very little is 
known about meaning (and meaningful components) in primates’ gestural 
communication (see following section). Furthermore, although at least some 
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studies show that sequences do not simply emerge because of the recipient’s 
lacking response, there is currently no evidence that gesture combinations 
are governed by specific rules. However, at least some studies demonstrate 
that gesture sequences are used to adjust communicative means to the re- 
cipient’s behavior, by either repeating a gesture or by producing a different 
gesture to elicit an appropriate response, providing further evidence for the 
flexible usage of great apes’ gestural repertoires. 


6.4 Meaning and reference 


Meaning is broadly defined as the information an individual intends to 
convey to another individual. If a signal can be linked to specific mean- 
ings, this is referred to as semanticity, which is yet another design feature 
of language (Hockett, 1960). This relationship between a word and its 
meaning is thought to be of arbitrary nature (Hockett, 1960), although 
some would argue that there is more iconicity present in language than it is 
currently assumed (Kendon, 2016). The question if primates also use iconic 
gestures has therefore attracted much research (Douglas and Moscovice, 
2015; Genty and Zuberbühler, 2014; Perlman and Cain, 2014; Tanner and 
Byrne, 1996), and is discussed later in this chapter. 

The use of such linguistic terms to describe animal communication sys- 
tems has been repeatedly criticized (Font and Carazo, 2010; Rendall et al., 
2009; Scott-Phillips, 2008). Some argue that the use of these linguistic terms 
might be misleading, since there might be no language-like meaning in pri- 
mate communication systems (Rendall et al., 2009). Others highlight that it 
is important to differentiate between message (which is the information the 
sender encodes in the signal and intends to convey) and meaning (which the 
recipient decodes from the signal and the contextual information) (Font and 
Carazo, 2010). In gesture research, however, this differentiation between 
the sender’s message and the derived meaning is usually not made. 

Currently, there seem to be two “camps” regarding the question if ges- 
tures do have specific meanings or not. As repeatedly described in this 
chapter, many researchers highlight the flexible nature of gesture use, with 
one gesture used in different contexts and several different gestures used 
to achieve the same goal (for an overview, see Call and Tomasello, 2007; 
Liebal et al., 2013b). The conclusion from this approach is that gestures 
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might not have tight, specific meanings and that the corresponding infor- 
mation a gesturer intends to convey might be determined by the context 
in which an interaction takes place. Therefore, the meaning of a specific 
gesture might vary depending on the context. For example, the “extend 
arm” gesture used by orang-utans can function as a submissive signal when 
used by a subordinate towards a higher-ranking individual. Alternatively, 
when used in interactions between mothers and their infants, it may serve 
as an invitation to follow (Liebal et al., 2010; Liebal et al., 2006). For 
chimpanzees, de Waal (2003, p. 22) reports that “the begging gesture (...) 
has absolutely no meaning unless one can deduce its referent from the 
context”, since they either direct this gesture to food possessors, while in 
conflicts they direct it at bystanders to request support. 

Other researchers conclude that it is possible to assign specific meanings 
to specific gesture types (Cartmill and Byrne, 2010; Hobaiter and Byrne, 
2014). In their approach, intended meaning is an equivalent to linguistic 
meaning in humans. For example, Hobaiter and Byrne, (2014) derived the 
intended meaning of a gesture from the “apparently satisfactory outcome”. 
This is used as an approximation of the gesture’s meaning, in a way that 
the gesturer’s goal is inferred from the satisfying outcome of a gesture, 
which resulted in the cessation of the interaction. Based on this approach, 
they found that wild chimpanzees use 10 of the previously identified 66 
gesture types (Hobaiter and Byrne, 2011a) for one specific outcome, while 
other gesture types were used for two or three of the total of 19 possible 
outcomes (Hobaiter and Byrne, 2014). Furthermore, 13 gestures were used 
in at least 70% for one specific outcome and were therefore classified as 
gestures with “tight meaning”, while 11 gestures types were used in only 
50% up to 70% for one specific outcome and were therefore gestures with 
“loose meanings”. Twelve gestures were classified as ambiguous signals. 
Thus, gesture types varied with regard to the number of potential outcomes 
as well as the frequency of use for one of those potential outcomes. Based on 
these findings, Hobaiter and Byrne (2014) conclude that most chimpanzee 
gestures can be assigned to one or more intended meanings. A very similar 
approach was used by Cartmill and Byrne (2010) to study the meaning of 
gestures in captive Sumatran orang-utans. They used 40 of their total of 64 
gestures to achieve one of six different outcomes, and thereof 29 gestures 
seemed to have only one specific meaning. Because of this frequent match- 
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ing between the goal of an interaction and the corresponding outcome, the 
authors concluded that despite the considerable degree of flexibility, there is 
evidence for intended meaning in at least some of the orang-utans’ gesture 
types (Cartmill and Byrne, 2010). 

Roberts and colleagues (2013), however, seem to combine the approach- 
es of these two “camps”, since they conclude that gestures of wild chimpan- 
zees have specific meanings, independent of the context, but recipients are 
able to flexibly interpret others’ gestures by additionally using contextual 
information (Roberts et al., 2012a; Roberts et al., 2013). 

Based on the current evidence, the conclusion whether primate gestures 
have specific meanings or whether they are flexibly used for different func- 
tions seems to depend on varying perspectives on the very same data set. 
Thus, great apes use most of their gestures in more than one context, while 
some of them represent context-specific gestures. However, while some au- 
thors focus on those gestures used for several functions and consequently 
highlight their flexible use (Call and Tomasello, 2007), others focus on 
the gestures used for one or few outcomes, and consequently emphasize 
the specific meaning(s) of these gesture types (Cartmill and Byrne, 2010; 
Hobaiter and Byrne, 2014). 

The question if gestures have specific meanings is tightly linked to the 
question if a gesture can refer to a specific referent. Reference is a funda- 
mental component of language; however, this term is very differently used 
across disciplines (Leavens et al., 2005b). While linguists usually use “refer- 
ence” synonymously with symbolic reference, developmental psychologists 
focus on nonverbal reference in the form of pointing gestures (Bates et al., 
1987; Camaioni, 2001). Furthermore, linguists usually emphasize that the 
relationship between a word and its referent is arbitrary (but see Kendon, 
2016), while developmental psychologists will argue that for pointing ges- 
tures this relationship is not arbitrary, since it is determined by the triadic 
(spatial) relationship between signaler, recipient, and the external entity 
(Leavens et al., 2005b). 

This different operationalization of reference in spoken language and 
gestural communication in humans is also evident in comparative research 
into primate communication, since reference is defined differently across 
modalities. Vocal researchers often focus on context-specific vocalizations 
“ .. to find the animal equivalent to referential words in human language” 
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(Liebal et al., 2013b, p. 399) in the form of functionally referential vocaliza- 
tions. They need to be reliably produced in response to a specific stimulus 
(Evans, 1997) and perceived independently of the context, which means 
that listeners should respond (appropriately) to this vocalization even in the 
absence of the eliciting stimulus, in the same way they would respond to 
the actual eliciting stimulus (Macedonia and Evans, 1993). Although some 
scholars challenge this traditional strong focus on functionally referential 
vocalizations (Wheeler and Fischer, 2012), it remains an intensively re- 
searched topic (Kalan et al., 2015; Murphy et al., 2013; Schel et al., 2013a). 

In the gestural domain, pointing gestures are considered referential sig- 
nals. However, in contrast to functionally referential vocalizations, pointing 
does not necessarily refer to specific entities and thus has no one-to-one 
referential meaning. Instead, the meaning of a pointing (or other referential) 
gesture can only be interpreted if the gesturer and the recipient share a com- 
mon ground (Liebal et al., 2013a; Tomasello et al., 2007). The nature of 
nonhuman primates’ pointing gestures, however, is fiercely debated (Leav- 
ens et al., 2005a; Tomasello, 2006). With very few exceptions (Hobaiter et 
al., 2013; Vea and Sabater-Pi, 1998), the production of pointing gestures is 
limited to interactions with humans, where primates point to request food 
rewards which they cannot obtain otherwise. Although it has been shown 
that great apes may also point to the location of a hidden tool so that the 
human is able to find the food (Call and Tomasello, 1994; Zimmermann 
et al., 2009), their underlying motivation is not necessarily to inform the 
human. Thus, with the exception of language-trained apes (Lyn et al., 2011; 
Menzel, 1999), primates do not use their pointing gestures declaratively, 
but imperatively, since they want the human to get the food for them. Fi- 
nally, there is little evidence that primates understand pointing gestures of 
humans (Mikl6si and Soproni, 2006) or even conspecifics (Tempelmann et 
al., 2013) to find hidden food in both cooperative and competitive settings. 
Together these findings suggest that even though primates produce point- 
ing gestures, they are mostly limited to interactions with humans, they are 
used in very specific contexts, and seem to serve a different function than 
pointing gestures in humans. 

Iconic gestures are considered a special type of referential signals, and 
have received special attention in theories of language evolution (Arbib, 
2005; Armstrong and Wilcox, 2007; Perlman and Cain, 2014). Unlike vo- 
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calizations, gestures “...can provide more obvious iconic links with objects 
and actions in the physical world” (Gentilucci and Corballis, 2006, p. 950), 
because hands (and arms) can be used to perform actions or to represent 
objects, which closely resemble real actions (Cartmill et al., 2012). 

The exact nature and definition of an iconic gesture, however, varies 
across studies or might overlap with terms like representative, pantomimic, 
and mimetic (Douglas and Moscovice, 2015; Perlman et al., 2014). In 
general, iconic gestures either depict a motion or the shape of a referent 
(Tanner and Byrne, 1996). One of the first studies reporting the use of iconic 
gestures by a single gorilla male was published by Tanner and Byrne (1996), 
who found that this male performed gestures indicating the path of motion 
either in space or on the female’s body to request further actions, such as 
moving into a specific direction. Later studies on other groups of gorillas did 
not find additional evidence for iconic gestures in this species (Genty et al., 
2009; Pika et al., 2003). More recently, however, several studies have been 
published suggesting the use of iconic gestures not only in gorillas (Perlman, 
Tanner, and King, 2012), but also in bonobos (Douglas and Moscovice, 
2015; Genty and Zuberbühler, 2014). For example, captive bonobos use a 
series of gestures incorporating iconic gestures, such as beckoning gestures, 
to indicate another individual to approach and then to jointly move to the 
indicated location to mate there (Genty and Zuberbihler, 2014). Some 
studies even claim that orang-utans and bonobos use pantomime (as a 
type of iconic gesturing) in a way that the corresponding action (referent) 
is re-enacted (Douglas and Moscovice, 2015; Russon and Andrews, 2011). 

Taken together, different criteria are used to define reference across mo- 
dalities. In gesture research, most studies have focused on primates’ produc- 
tion and comprehension of pointing gestures, mostly in interactions with 
humans. There is currently little evidence that primates produce pointing 
gestures in interactions with conspecifics, and that they understand others’ 
pointing gestures as referential means of communication. Furthermore, our 
current knowledge about iconic gesture use in nonhuman primates is still 
very limited, so that any conclusions regarding a gestural origin of human 
language seem to be too premature. 
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6.5 Acquisition of gestures 


How primates acquire their gestures during ontogeny is an interesting topic 
to study, because this provides inside into the flexibility and “openness” of 
this communicative system. Different mechanisms have been suggested, each 
leading to different predictions regarding how gestures might be acquired 
(Genty et al., 2009; Schneider, Call, and Liebal, 2012b). For example, if 
gestures are genetically determined, we would expect that all individuals of 
a given species use rather similar repertoires, with little variability across 
individuals. This would also mean that individuals do not learn or create 
new gestures to incorporate them into their repertoire, resulting in relatively 
“closed”, species-specific repertoires (Genty et al., 2009; Hobaiter and By- 
rne, 2011a). It is important to point out, however, that the genetic deter- 
mination of gestures would not necessarily mean that structural properties 
of gestures or their use could not be modified over an individual’s lifetime. 
Furthermore, genetic transmission does also not imply that all gestures have 
to be present right after birth (Rosati et al., 2014). On the other hand, if 
there is much variability of gestural repertoires across individuals, and if the 
number of gestures increases over lifetime, with new, mostly idiosyncratic 
elements being added to the existing repertoire, then gestures are most likely 
acquired by some other mechanism than genetic transmission (although 
a genetic component can still not be excluded) (Liebal et al., submitted). 

Two scenarios have been proposed. First, gestures could be acquired by 
some form of social learning, with one individual acquiring parts of the 
behavioral repertoire of another individual (Whiten and Ham, 1992). In this 
context, imitation is discussed as one potential mechanism, which would 
result in very similar individual repertoires within a group, since individuals 
learn their gestures from each other, while across groups, repertoires are 
expected to vary (Tomasello et al., 1997). Second, if ontogenetic ritualiza- 
tion is the major mechanism underlying gesture acquisition, which involves 
the shaping of previously non-communicative behaviors into increasingly 
ritualized, communicative gestures (Call and Tomasello, 2007; Tomasello, 
2008), we would expect a high degree of variability of individual repertoires 
within and across groups. 

Only very few studies investigated how primates acquire their gestures, al- 
most exclusively in great apes, in both natural (Fröhlich et al., 2016a, 2016b), 
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and captive settings (Bard et al., 2014; Schneider et al., 2012a), resulting in 
rather mixed findings (Liebal et al., submitted). While some studies conclude 
that gestures of great apes are most likely genetically determined, because 
even across species, individuals shared very similar repertoires (Genty et al., 
2009; Hobaiter and Byrne, 201 1a; Schneider et al., 2012b), others conclude 
that gestures emerge in social interactions (Bard et al., 2014; Fröhlich et al., 
2016b; Halina et al., 2013). Across existing studies, however, imitation does 
not seem to play any significant role in gesture acquisition. 

There are a couple of reasons for these inconsistent findings. The logis- 
tics underlying data collection are often challenging, since it is difficult to 
observe several individuals, who are often not in the same group or zoo, 
over longer periods of time. As a result, sample sizes are very small. Ob- 
servational periods are often too short to capture the individuals’ complete 
repertoires, resulting in an overestimation of variability between individual 
repertoires. Furthermore, researchers use different criteria for defining a 
gesture as well as intentionality (see sections 2 and 3), and as a result, the 
number of identified gestures might vary drastically between studies, which 
complicates comparisons between studies of different research groups. 
Therefore, it is difficult to use the current evidence to either support or 
reject a gestural origin of human language. 


7. What do primate gestures tell us about language 
evolution? 


This chapter provided an overview of those aspects of primate gestural com- 
munication, which are frequently studied with the aim to find evidence for 
a gestural origin of human language. The current knowledge demonstrates 
that primates use their gestures intentionally and flexibly across a variety of 
different contexts, and that they adjust them to the recipient’s behavior (Call 
and Tomasello, 2007). It is important to note, however, that the majority of 
gesture research focuses on great apes, often in captive settings (Slocombe 
et al., 2011). Furthermore, studies vary regarding the kind and number of 
criteria applied as markers for intentional use and they also operationalize 
flexibility in different ways. Still, it seems that primates’ gestural communi- 
cation is characterized by a high degree of flexibility in gesture production 
and usage, which seems to support a gestural origin of human language. 
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The evidence for rule-governed, meaningful combinations and referenti- 
ality — either in the form of pointing gestures or iconic gestures resembling 
actions or objects — is rather mixed. This also concerns mechanisms of 
gesture acquisition, which have been found to vary drastically across stud- 
ies. Differences between species and/or studies are most likely explained by 
varying definitions and methodological approaches, with one of the most 
serious shortcomings representing the lack of shared definitions — such as 
for the term “gesture” — or the reliance on linguistic terms. Based on the 
current evidence, it seems rather difficult to draw any substantial conclu- 
sions regarding the potential origin(s) of human language. 

What can we conclude from this current situation? Is it correct that “... 
studies of nonhuman animals provide virtually no relevant parallels to hu- 
man linguistic communication ... thus leaving any insights into language’s 
origins unverifiable” (Hauser et al., 2014, p. 1)? This strong conclusion 
might not be justified — even if the current evidence is not sufficient to vali- 
date a gestural origin of human language, comparative research across dif- 
ferent modalities convincingly showed that it is possible to identify building 
blocks to human language in the communication of other primates, reject- 
ing the notion that language evolved from scratch in humans only. Further- 
more, there are currently promising developments in the field of primate 
gesture research, which will — at the very least — broaden our knowledge 
about this modality, and in the best case, shed light on the origins of human 
language. The final section of this chapter will therefore summarize some 
recent developments as well as “blind spots” in primate gesture research. 

First, there is an increase of studies investigating gestural communica- 
tion of primates in their natural habitats (Douglas and Moscovice, 2015; 
Fröhlich et al., 2016a; Hobaiter and Byrne, 2011a; Pika and Mitani, 2006; 
Roberts et al., 2013). This addresses the challenge that most of our current 
knowledge about primate gestural communication is based on captive stud- 
ies, therefore potentially lacking ecological validity. However, systematic 
studies are needed, which systematically compare gesture use in natural and 
captive settings, to estimate the impact of the environment on primates’ 
repertoires and gesture usage. 

Second, since flexibility and the use for different functions is an impor- 
tant aspect in gesture research, context-specific gestures, which are used to 
achieve one specific goal, have been largely neglected or might have been 
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even excluded from further analyses. Consequently, it is possible that poten- 
tially referential gestures — which are linked to specific events, individuals, 
or objects, but are not used for more than one social function — have not 
been discovered yet. 

Third, almost all existing gesture studies consider proximate aspects, 
such as the underlying cognitive mechanisms or developmental pathways of 
gestural communication. There are, however, virtually no studies which in- 
vestigate ultimate aspects of gesture use, such as whether the use of specific 
gestures in “evolutionary urgent contexts” like consortships in the sexual 
context (Hobaiter and Byrne, 2012), has specific benefits for the gesturer, 
for example, in the form of increased numbers of mating partners. If we 
have a better understanding of the gestures’ effect on the recipient, then 
we are better able to investigate the evolutionary function of gestures in 
terms of their costs and benefits. This, in turn, might enable us to develop 
more sophisticated gestural scenarios of language evolution, specifically 
with regard to the adaptive functions of gestural communication in the last 
common shared ancestor of humans and great apes. 

Fourth, in comparative gesture research, gesture types are usually de- 
fined based on their function (e.g., ‘food beg’) rather than their structur- 
al properties (Roberts et al., 2012b). Therefore, it is currently unknown 
whether changes in these structural properties (e.g., arm extended with 
hand stretched versus arm extended with hand bent) result in a differential 
use — and thus potentially different meanings of these gestures. 

Fifth, it is important to identify and differentiate the many different 
“components” of language and the abilities related to its use, to carefully 
select which aspects of primate gestural communication should be investi- 
gated and to decide which comparisons across species are meaningful at all 
(Fitch, 2005, 2010; Hauser et al., 2002). For example, although the criteria 
for the definition of intentionally used gestures are adapted from research 
into gestures of pre-linguistic children (Leavens et al., 2005b), there are 
very few studies which systematically compare gesture use in nonhuman 
primates and young children (Bohn et al., 2015; Liszkowski et al., 2012). 
In her review, Pika (2008) concludes that the majority of primate gestures 
are dyadic gestures, used to imperatively request immediate actions from 
the recipient, while children start to use triadic gestures from an early age 
on, which they also use declaratively to communicate about external entities 


288 Katja Liebal 


outside the dyad, such as objects or events. Furthermore, many gesture types 
of primates closely resemble actions (Liebal and Call, 2012), while children 
from an early age on also use more abstract, ritualized gestures (e.g., ‘wave 
good bye’, head shakes for “no”). Therefore, it seems that the gestural com- 
munication of nonhuman primates is fundamentally different from gesture 
use in pre-linguistic children. However, because of the lack of systematic 
comparisons of the same gestural properties in nonhuman primates and 
humans, it seems difficult to conclude which aspects of their communication 
are informative with regard to hypotheses of language origins. 

Finally, the review by Slocombe et al. (2011) showed that only few 
studies on primate communication investigate more than one modality 
at a time. However, given that language is multimodal (Kendon, 2004; 
McNeill, 1992), it seems unlikely that language evolved either from vo- 
calizations or gestures or facial movements (Arbib et al., 2008). Therefore, 
recent publications suggest a multimodal language origin (Wacewicz and 
Zywiczynski, in press), with special attention paid to the role of orofacial 
movements. Thus, unlike facial expressions, orofacial movements seem to 
be voluntarily produced, because nonhuman primates have direct connec- 
tions between the premotor cortex and the different nuclei controlling the 
jaws, lips and tongue (Jiirgens, 2002). The importance of a multimodal 
approach to primate communication (Liebal et al., 2013b; Slocombe et al., 
2011) is now increasingly acknowledged, since there is an increasing num- 
ber of studies which investigate different modalities in a more integrated 
way (Ghazanfar and Logothetis, 2003; Habbershon et al., 2013; Leavens 
and Hopkins, 2005; Micheletta et al., 2013; Partan, 1999; Taglialatela 
et al., 2015; Taglialatela et al., 2011; Wilke et al., 2017). Although some 
questions and certainly methods might be modality-specific, a multimodal 
approach is essential to capture the complexity of primate communication 
and to use this knowledge to solve the “mystery” of language evolution. 
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Dendrophilia and the Evolution of Syntax 


“The human understanding is of its own nature prone to suppose the existence of 
more order and regularity in the world than it finds” 


-- Francis Bacon (Bacon, 1620) 


Abstract: This chapter reviews data showing that humans are unusual in their 
ability and propensity to attribute tree structures to data streams — a trait I call 
“dendrophilia”. Unlike phonological rules, which can be captured by finite-state 
rules, syntactic rules require more powerful algorithms at the context-free level, 
which enable the flexible and extensible generation of tree structures. I review data 
supporting a key role for Broca’s area, via its powerful connections between inferior 
frontal gyrus and the parietal and occipital lobes, as playing the computational role 
of an auxiliary memory (comparable to a “stack” in computer science) to support 
these more powerful algorithms in human “wetware”. 


Keywords: Animal communication, syntax acquisition, regular/supra-regular gram- 
mar, Broca Area 


1. Introduction 


A complete understanding of human language evolution will require a bio- 
logically grounded partitioning of the human ability to acquire, process and 
produce language into its many cognitive/biological components. These 
will include such general cognitive characteristics as perception, memory, 
categorization, generalization, rule learning, social cognition, motor plan- 
ning and vocal or manual output, in addition to more specifically linguistic 
features as syntax or semantics. This broad set of abilities needed for lan- 
guage, as a whole, can be termed the “faculty of language in a broad sense” 
or FLB (Fitch et al., 2005; Hauser et al., 2002), and it is widely agreed that 
most of its sub-components are shared with other animals (Fitch, 2010; He- 
spos and Spelke, 2004; Pinker and Jackendoff, 2005; Seyfarth and Cheney, 
2014). From an evolutionary viewpoint, this indicates that such shared 
features did not evolve specifically for language, but were either present 
among common ancestors, or evolved convergently in other clades for other 
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purposes (e.g. vocal production learning in birds, or descended larynges in 
deer or lions). 

We can further partition the FLB into those capabilities that are shared 
with our nearest living relatives, the great apes, and specifically with chim- 
panzees and bonobos. We can infer that this set of cognitive abilities was 
present in the last common ancestor (LCA) that we shared with chimpan- 
zees, and thus had evolved prior to the separation of the Pan and hominin 
lineages roughly six million years ago. This set of abilities provides the 
starting point from which specifically human aspects of language must 
have evolved; we might dub this cognitive ability set (CAS) of the LCA 
the CAS, ca. 

By subtracting this CAS;c, from the FLB, we arrive at a short list of 
cognitive skills that represent derived traits of humans — a quite brief list 
of novel traits whose evolution requires some specific explanation in any 
complete theory of language evolution. These are the explananda of a rich 
biological and evolutionary understanding of language. We can term this 
core set the FLD - the derived subset of the broad set of all capacities used 
in language. This subset is not the same as the “faculty of language in the 
narrow sense” (FLN) as previously defined by (Fitch et al., 2005; Hauser 
et al., 2002). The FLN refers to those traits that are not shared with any 
animal (including, say, birds, deers and lions) and are not used in other 
non-linguistic domains of human cognition (such as social, visual, or music 
cognition). This is a highly restrictive definition, that was hypothesized 
by Hauser, Chomsky and Fitch to be limited to a single general purpose 
recursive ability that can map to multiple input and output systems. Re- 
cent data indicating that recursive abilities are also available in two non- 
linguistic domains (visual pattern parsing and music) suggest that, by the 
strict definition above, even recursion is not part of the FLN (Martins et 
al., 2015; Martins et al., 2014). In contrast, FLD as defined here refers to 
those linguistically relevant traits that differentiate humans from other great 
apes, regardless of domain specificity. This is a more biologically meaning- 
ful category (because it does not lump convergently evolved traits in birds 
together with homologies among primates) and has a clear and unambigu- 
ous phylogenetic meaning: this is what theories of language evolution that 
take the LCA as their starting point need to explain. 
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Previous analyses of the comparative data have concluded that there 
are at least three broad areas in which we differ from chimpanzees and 
bonobos: signaling mechanisms, complex syntax, and semantics/pragmat- 
ics (Fitch, 2005, 2010). Signaling mechanisms are those involving speech 
production and (perhaps) perception (Fitch, 2000, 2009; Lieberman, 2007; 
MacNeilage and Davis, 2000). Semantic/pragmatic abilities have to do with 
meaning and inference, and include such factors as advanced theory of 
mind, pragmatic implicatures or ostensive signaling (Scott-Phillips, 2014; 
Stegmann, 2013). But the focus of the current chapter will be complex 
syntax. 


2. Syntax — Shared and Derived Components 


“Syntax” refers to our ability to arrange words into an unlimited variety of 
novel sentences that express specific meanings due to their structure. Note 
that although “syntax” in the English language is mostly accomplished via 
changes in word order (“dog bites man” vs. “man bites dog”), in many 
other languages including Latin or German word order plays a minor role, 
and such factors as case marking are the key determinants of syntactic 
structure. What is central to human syntax in general is the building up 
of complex hierarchical structures that combine a finite set of words and 
morphemes into an unlimited potential set of sentences, each of which has 
a different interpretation (Chomsky, 1957, 1965). Crucially, syntactic rules 
apply to these structures and not just the individual words they are made up 
of. Thus, we know that in the sentence “The boy who saw the dog chased 
the girl” it is the boy who chased the girl, despite the fact that the string 
“the dog chased the girl” is contained in that sentence, and that “boy” is 
further than “dog” from “girl”. This is because we interpret “who saw 
the dog” as a phrase, the whole of which modifies “the boy”, and which 
has nothing to do with the girl. This is just a simple example to show that 
the interpretation of syntax hinges on structure, not just word order or 
concatenations of words. 

Linguists typically consider syntax separately from semantics, not be- 
cause the two are not inter-related, but because it is easily possible to con- 
struct syntactically perfect sentences that are semantically deviant, or even 
uninterpretable (Chomsky’s famous sentence “colorless green ideas sleep fu- 
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riously” is a classic example). Thus, syntax is a very general rule-application 
system that can generate a huge number of sentences, only some of which 
are semantically useful or meaningful. It is also crucial that syntax applies 
across modalities, and works in precisely the same structure-determined 
way in signed language, speech, or written language. Thus, I will consider 
syntax here to be both modality-independent and separable from meaning. 
Thus, we can understand complex syntax as a type of learning, generali- 
zation and application of particular types of rules, and consider it in the 
broader cognitive context of rule learning, applying across sensory domains 
and equally applicable to perception and production. 

A crucial next step is to attempt to characterize human syntax and to 
differentiate it from the rule-learning abilities that we share with other 
species. This attempt can profit greatly from a clear and unambiguous 
conceptual framework for characterizing different types of rules, in terms 
of their complexity, generative capabilities and their computational re- 
quirements. Such a framework is provided by formal language theory, a 
branch of mathematics founded by Alan Turing with his introduction of the 
Turing machine (Turing, 1937). Formal language theory is used today espe- 
cially in computer science for clarifying the computational requirement and 
complexity of particular problems, algorithms or programming languages 
(Hopcroft et al., 2000; Minsky, 1967; Post, 1943). Formal language theory 
has long been applied to human language as well, and some refinements of 
the theory by Noam Chomsky and colleagues in the late 1950s provided 
the first attempts to precisely characterize the computational requirements 
for human linguistic syntax (Chomsky, 1956, 1957, 1959). These early at- 
tempts have been memorialized with the term “Chomsky hierarchy” which 
is still often used today to refer to a simple four-way classification of rule 
systems (Hopcroft et al., 2000; Sipser, 1997). Modern work in computa- 
tional linguistics has progressed far beyond this classification (cf. Jager and 
Rogers, 2012; Joshi et al., 1991), extending the Chomsky hierarchy beyond 
this original four-way categorization. This broader, modern framework is 
best termed simply the formal language hierarchy to distinguish it from the 
older Chomsky hierarchy (though the term “extended Chomsky hierarchy” 
is still sometimes used (Jager and Rogers, 2012)) 

Today, computational linguistics has converged upon a clear and specific 
characterization of the abilities required to account for all known phe- 
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nomena in human language syntax, known as the mildly context-sensitive 
grammars (Joshi et al., 1991; Stabler, 2004; Vijay-Shanker et al., 1987). 
These grammars, while vastly more restricted than a generalized Turing 
machine, have all of the capabilities needed to capture both the strings and 
the hierarchical structures of human syntax. 

How do these well-defined abilities differ from the rule-learning capa- 
bility of other animals? There is a rich comparative dataset now available 
concerning animal rule- and pattern learning in a wide range of species 
from pigeons and rats to chimpanzees, and abundant data make it clear 
that many species do have well-developed abilities both to learn rules, and 
to apply them to novel stimuli (generalization). However all of the rule- 
learning abilities demonstrated by nonhuman animals can be captured at 
a lower level of the formal language hierarchy, at the so-called “regular 
grammar” level (also called finite-state grammars). The type of regular 
rules that animals have been shown to learn include transition probabili- 
ties (“b follows a”), simple string concatenation (e.g. the (AB)” grammar 
discussed below), so-called “algebraic rules” like XXY (where XX denotes 
any syllable repeating itself), and simple long distance dependencies (such 
as AB*A, which means that there must be an A at both the beginning 
and end of the string with arbitrary Bs between). Thus, there can be little 
doubt that a variety of simple regular grammatical abilities represent part 
of the FLB which is shared very widely, including with apes, and are thus 
a component of the CAS; ca. But grammars that go beyond this level — the 
“supra-regular grammars” — have not yet been conclusively demonstrated 
in any nonhuman animal species (see below). 

We thus arrive at a clear characterization of the syntactic component of 
the FLD, the derived aspect of human syntactic abilities that differentiates 
our syntactic abilities from those of other species: this derived component 
is comprised of our supra-regular capabilities. In formal language theory, 
any type of grammar has, at its core, a finite-state automaton with a limited 
(but potentially large) set of states that could represent words or transi- 
tions between words and/or particular combinations of words - this is as 
true for a Turing machine as for a system implementing a simple regular 
grammar. What differentiates the supra-regular grammars is some sort of 
additional flexible memory, which can be used to store information relevant 
to past states and their “unfinished business”. Thus, the key computational 
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requirement differentiating human syntactic capabilities would be such 
an additional general purpose memory store. In formal language theory 
there are many different types of such additional memory, including stacks, 
queues and the “stack of stacks” needed to parse mildly context-sensitive 
languages. However, I would suggest that this is a place where formal lan- 
guage theory is most urgently in need of updating to a more biological and 
neurally grounded model of natural computation. Thus, rather than being 
concerned with whether this additional form of memory is a stack or a 
queue (Ottl et al., 2015; Udden et al., 2012) I think it is more profitable to 
simply think of it as a flexible type of working memory able to keep track 
of and recall past states over the span of a sentence. 

From this more general viewpoint we can consider the animal syntactic 
capabilities demonstrated to date as being more akin to long-term memory 
for items and categories. These are learned and then retained indefinitely, 
and many studies show that animals can have very large long-term mem- 
ory stores for such information. For example, pigeons and baboons can 
memorize thousands of individual images (Fagot and Cook, 2006) and 
dogs can memorize more than a thousand specific words and match them 
to objects (Griebel and Oller, 2012; Kaminski et al., 2004; Pilley and Reid, 
2011). Despite this impressive storage capacity, what none of these species 
do is flexibly and temporarily build up novel and complex hierarchical 
combinations of these stored items — requiring a type of structural work- 
ing memory able to generalize over many tokens or types and incorporate 
them into hierarchical structures (Caplan and Waters, 1999). I am obvi- 
ously not claiming that animals lack working memory — many cognitive 
tasks (such as delayed match to sample tasks (Finch, 1942; Gazes et al., 
2013), or simple object permanence (Pollok et al., 2000)) demonstrate that 
they do, but we are discussing here a particular class of working memory 
that is distinct from these more general abilities (cf. Caplan and Waters, 
1999). Rather this is a specific form of working memory suited to assigning 
structures to input strings (for example, parsing a sentence) or converting 
some semantic structure into a syntactic structure and then outputting it as 
a string. Without such a structural working memory store, it is impossible 
to flexibly build up and manipulate tree-like structures — a crucial require- 
ment of human syntax. 
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Because such terms as “structural working memory” are somewhat 
vague, and unambiguous terms like “supra-regularity” require a body of 
technical knowledge to be understandable, I have given this particular de- 
rived subset of human syntactic capabilities a more memorable name: den- 
drophilia. Dendrophilia (Fitch, 2014) is my term for the human proclivity 
to attribute tree-like structures to sensory patterns (composed from the 
Greek roots “dendro”, meaning tree, and “philo-” meaning “loving, fond 
of, tending to”) it is in short our “love of trees”. By hypothesis, this capabil- 
ity applies across different sensory domains, and is as applicable to visual 
signs as spoken words, and indeed applies to music as well (Fitch, 2014). 
It is this component of complex, structural syntax that I hypothesize to be 
a relatively recent acquisition, post-dating the separation of the hominin 
lineage from that of other apes. Thus, dendrophilia is a core component 
of the human FLD. 


3. Empirically Evaluating the Dendrophilia Hypothesis. 


From a mathematical viewpoint it is relatively trivial to create rule systems 
that go beyond the regular level. For example, the simple grammar A”B”, 
where n = 1, generates string in which the number of As matches the number 
of Bs (examples include “AB”, “AABB”, “AAABBB”, etc). This grammar 
is supra-regular because it requires n to be stored after the evaluation of 
the A string, and then recalled after the Bs have been counted to see if they 
are the same. In this case the only additional form of memory required is 
two simple integer counters, and a phrase comparison process by which 
the A and B components can be recognized and distinguished. Although 
this is in some sense the simplest possible supra-regular grammar, it seems 
to be beyond the capabilities of nonhuman animals. But there are many 
other more interesting (and more linguistically relevant) grammars that 
are supra-regular. These include the “mirror grammar” which can be rep- 
resented w wÈ, where w is any string and R indicates its mirror image, or 
the copy grammar w w. Both of these go beyond the capabilities of regular 
grammars, as do countless other possible rule sets. 

Despite the relative ease of demonstrating mathematically that a gram- 
mar is supra-regular (often using some variant of the “pumping lemma”, 
see Hopcroft et al., 2000; Jäger and Rogers, 2012) it is less trivial to dem- 
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onstrate in a practical empirical experiment that some participant has in 
fact learned a supra-regular grammar. There are two essential empirical 
problems. The first is that mathematical proofs are based on allowing the 
length of strings (or substrings) to go to infinity: otherwise they can be 
captured by a large but finite set of states (amounting to a simple list of 
all strings). In reality of course, both our memories and our lifespans are 
limited (as is the patience of our human or animal participants) and so we 
cannot test them with infinite strings and attain this level of certainty in em- 
pirical experiments. However, what we can do is expose them to a training 
set of some limited-length strings, and then see if and how they generalize 
to longer (or shorter) strings. For example, if exposed to A”B”, with n = 2 
or 3 (and strings of length 4 or 6), we can then see how our subjects react 
to A”B”, with n = 4 or 5. If they are able to generalize over n in this case, it 
cannot be because they simply memorized the previous patterns, or some 4 
or 6 slot schema that generalizes over them. Rather, they must have inferred 
that there is some variable n which counts the As and Bs and must be the 
same for both sets. Thus, demonstrating generalization over n is crucial for 
demonstrating mastery of the A”B” grammar. 

The second empirical problem is more subtle: we need to be sure that 
our subjects have not figured out some regular-level “trick” that provides 
correct answers most of the time, but does not actually encapsulate the 
supra-regular rule. Given that we don’t expect our participants to be per- 
fect — given enough trials, mistakes will be made due to distraction or bore- 
dom, even if participants have indeed inferred the correct grammar — we 
need some way to distinguish above-chance performance due to a regular- 
level grammar from below-perfect performance due to such errors. 

For example, the first study to examine the A”B” grammar (Fitch and 
Hauser, 2004) contrasted it with a simple regular grammar (AB)", which 
means “repeat AB as many times as you want” (producing strings like “AB”, 
“ABAB”, “ABABAB”, etc). Given these two grammars, and limited x, it is 
easy to come up with a regular rule that discriminates them. For example, 
checking whether there are any “BA” transitions within the string would 
suffice. Indeed, keas (a large-brained New Zealand parrot species) come up 
with precisely this regular rule to accomplish this discrimination. Similarly, 
after training with AABB and AAABBB strings, keas learned either the rule 
“starts with AA” or “ends with BB” (Ravignani et al., 2015). In order to 
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exclude this or similar possibilities, it is necessary to test participants with 
various ungrammatical novel patterns, termed “foils”, and show that they 
can exclude them. In the present case, for the A”B” grammar, it is enough to 
examine foils where the number of As and Bs differs (a “mismatched foil”). 
Indeed the keas in these experiments failed to reject such strings, showing 
that despite their above-chance performance during training, they failed to 
induce the intended supra-regular rule for these strings. 

In summary, any clear demonstration of any grammar, and supra-regular 
grammars in particular, requires participants to both generalize to a novel n 
(novel string lengths) and to reject myriad foil stimuli that test for various 
simpler regular strategies. 


4. Experimental Tests in Humans and Animals 


Numerous published studies have demonstrated that humans successfully 
acquire supra-regular rules, including all of those mentioned above. Human 
mastery of the A”B” grammar has now been demonstrated in many labs, 
including in a few cases with generalization over n and appropriate rejec- 
tion of mismatched foils (Bahlmann et al., 2006; Bahlmann et al., 2008; 
Hochmann et al., 2008; Ravignani, et al., 2013; Stobbe et al., 2012); other 
labs have found some mastery of this grammar but lacked these crucial test 
cases (cf. Fitch and Friederici, 2012; Perruchet and Rey, 2005). Human 
participants have also been shown to master mirror and copy grammars, 
which demand more sophisticated supra-regular abilities (de Vries et al., 
2008; Ottl et al., 2015; Udden et al., 2012). These data combine with other 
studies in the musical domain (Rohrmeier et al., 2012) and a long history of 
computational linguistic studies (Huybregts, 1985; Shieber, 1985; Stabler, 
2004) to confirm that humans have capacities above the regular level, and 
that these can be demonstrated in experimental studies in the laboratory. 
Similarly, as already mentioned above, nonhuman animals of various 
species have been shown to master various finite-state grammars (reviewed 
in ten Cate and Okanoya, 2012). The type of regular rules that animals 
have been shown to learn include transition probabilities (“B follows A”), 
simple string concatenation (e.g. the (AB)" grammar), “algebraic rules” like 
XXY and simple long distance dependencies (Chen et al., 2015; Murphy et 
al., 2008; Ravignani et al., 2013; Sonnweber et al., 2015; Spierings and ten 
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Cate, 2016; Stobbe et al., 2012; Wilson et al., 2013). These data leave little 
doubt that the ability to learn simple regular grammars represent part of the 
FLB which is shared very widely, including with apes, and thus represents 
a component of the CAS;ca. 

In contrast, there are only three studies in the current literature claiming 
to show that animals can master supra-regular grammars, none of them 
credible. All of them are based on the A”B” grammar, which has often mis- 
takenly been claimed to provide a test for “recursion” or “center embed- 
ding” (a misinterpretation of formal language theory, cf. Fitch, 2014; Fitch 
and Friederici, 2012). Of these a study in baboons by Rey et al., (2012) is 
the weakest, because it did not test for generalization over n (and indeed 
n was limited to 2 in these experiments). In this study 11 baboons were 
trained on a paired-associate task. Each baboon learned to associate a set 
of six arbitrary pairs of meaningless symbols, a1-a2, b1-b2, etc by pressing 
the symbols in order on a touch screen. After achieving 80% correct on each 
of these single pairs, the baboons were tested with two stimuli: they were 
first shown a1, and then b1. The key choice came next, when the baboons 
were simultaneously presented with both a2 and b2. The baboon had to 
select both of these stimuli to get a reward, with either order rewarded. 

The central finding of this study is that baboons pushed b2 first roughly 
twice as often as a2 (in test2 the bias was weaker, about 3:2). In other 
words, the baboons showed a significant bias to match the most-recently 
presented stimulus (b), rather than the initially presented one, so that the 
pattern ab-ba was significantly (though see below) more frequent than 
ab-ab. This pattern held over individual animals, at least in the first test. 

The authors claimed that this phenomenon demonstrates “recursion” and 
“centre-embedding” in baboons. But in fact this does not even demonstrate 
mastery of A”B”, a less-demanding goal, because it lacked the appropriate 
controls (extensions and mismatched foils). The most obvious alternative in- 
terpretation of the presented results is that the baboons show a simple recency 
effect, as is typical in memory experiments. The most recently presented pair 
remains more activated than the initially presented pair, and thus is completed 
first. This is predicted by many models of visual memory, and has no obvious 
relationship to embedding, recursion, or supra-regularity (for a more detailed 
and extensive critique, see Poletiek et al., 2016). 
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A second study with Bengalese finches was slightly better designed (Abe 
and Watanabe, 2011), in that the authors did test for generalization over 
n; more interestingly the neural basis of the rule-learning ability was inves- 
tigated. Unfortunately this study did not test with mismatched foils and 
so cannot reject the possibility of a regular “trick” being used to achieve 
success on the task (again a more detailed critique, and various alterna- 
tive explanations to those given by the authors, is given in (Beckers et 
al., 2012)). Clearly this study does not convincingly demonstrate supra- 
regularity either. 

The best, and superficially most convincing, attempt to show mastery of 
the A”B”, grammar by an animal species was a study published in Nature by 
Gentner and colleagues (Gentner et al., 2006). This study is the only appar- 
ently positive result to have a convincing set of control stimuli, and showed 
both generalization and rejection of mismatched foils. This study investi- 
gated eleven starlings (Sturnus vulgaris) in an operant learning paradigm, 
where the A stimuli were starling “rattle” vocalizations and the Bs were 
“warbles”. Starlings are songbirds with a quite complex song, where both 
males and females sing, and continue to learn new songs throughout their 
life. Thus, they seem to be a likely species in which complex rule learning 
might be expected. Training took a huge number of trials (between 9,400 
and 56,000 trials, depending on the bird) and according to the authors 
“was slow by comparison to other operant song-recognition tasks”. After 
training with n = 2 stimuli, nine birds showed successful generalization to 
n = 3 and 4. Furthermore, mismatched foils were tested (e.g. A’B? or A*B’), 
and were successfully rejected by most of the individual birds (although 
successful here means simply that the birds accepted the correct strings 
(e.g., A?B? or AB) significantly more often, not that they always rejected 
the mismatched strings. Nonetheless, this result seemed to demonstrate 
that starlings indeed had learned the supra-regular rule intended by the 
experimenters (although again, the authors erroneously claimed that this 
demonstrated recursion in these birds). This would have been a very excit- 
ing result, opening the door to investigations of the neural basis of supra- 
regularity in an animal species. 

Unfortunately this positive conclusion was premature. The bad news 
came from a group of Dutch researchers in Carel ten Cate’s laboratory in 
Leiden, who did an experiment very similar to that of Gentner et al (2006) 
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but now using zebra finches (van Heijningen et al., 2009). Although the 
birds did seem to succeed on the task, using the same criteria and analyses 
as those used in the Gentner study, when birds were analyzed individually 
it was found that the “success” seen at the group level was not mirrored at 
the individual level, and that indeed each individual bird seemed to be us- 
ing a different regular strategy to solve the problem. Only when birds were 
combined, resulting in a fictitious “super-bird” that correctly responded to 
all stimuli, did it appear that the group as a whole could correctly solve the 
task and learn the intended grammar. The same could be true in the starling 
study, since only averaged discrimination scores for all successful starlings 
combined were published for these crucial foils. After this critique appeared 
in a high-profile journal, it was followed immediately by a response letter 
to the editor by Gentner and colleagues, published in the same journal, but 
no reanalysis of the individual starling data has ever been published in the 
subsequent decade. This implies that the starling data too can be explained 
by individual birds adopting simpler regular strategies, that only when 
averaged together yield the appearance of success. 

Multiple previous and subsequent studies have shown that the A”B” 
grammar is not learned by cotton-top tamarins, pigeons, zebra finches or 
keas (Fitch and Hauser, 2004; Ravignani et al., 2013; Stobbe et al., 2012; 
van Heijningen et al., 2009), based on the animals’ failure to reject mis- 
matched foil strings. As already mentioned, keas trained to discriminate 
between A”B” and (AB) strings were able to successfully generalize, but 
failed to reject mismatched foils. A maximum likelihood analysis (Ravig- 
nani et al., 2015) showed that each individual bird came up with regular, 
bigram-based strategies to solve this task. For example when trained to 
respond positively to AABB and AAABBB strings, four out of five keas 
used a “BB recency” rule (choose stimuli with two Bs at the end), and the 
remaining bird used an “AA recency” rule (accept stimuli with two As at 
the beginning). While these strategies do allow birds to generalize to new 
stimulus lengths (accepting for example AAAABBBB), they do not allow 
the birds to correctly reject such mismatched stimuli as AABBB or AAABB. 
Furthermore, even with further training and feedback where each trial 
involved a matched and an unmatched stimulus, side-by-side, these highly 
intelligent birds were unable to master this task, which was mastered in 
less than ten minutes of exposure by humans (in comparison to months of 
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training for keas or starlings). This task, apparently trivial to humans in 
either the auditory or visual domain, is extremely difficult for other species 
that have been tested. 

In summary, all of the available human and animal data, as of late 2016, 
are consistent with the following proposals: 


1) Both humans and animals are able, using diverse tasks and stimuli, to 
master various regular (finite-state) grammars; 

2) Humans alone are able, again using multiple modalities and grammars, 
to master various supra-regular grammars. 


These data are thus clearly consistent with the dendrophilia hypothesis, 
that humans have both an ability, and a propensity, to find higher-order 
tree-like structures in a set of input strings. Note that any individual human 
could have performed perfectly well, as the keas did, by using a regular 
strategy (like AA primacy or BB recency) to process the A”B” grammar: such 
strategies are perfectly compatible with the training data. But, although the 
occasional human subject appears to do so (as revealed by failing to reject 
mismatched stimuli, see (Hochmann et al., 2008; Stobbe et al., 2012)) the 
vast majority follow a supra-regular strategy nonetheless. The human data 
are thus consistent with the observation of George Miller, who long ago 
initiated the study of artificial grammar learning in his “Grammarama” 
project, that 

“constituent structure languages {supra-regular languages} are more natural, easier 

to cope with, than regular languages... The hierarchical structure of strings gener- 

ated by constituent-structure grammars ... would be easier for people than would 


the left-to-right organization characteristic of strings generated by regular gram- 
mars” (p. 140, Miller, 1967) 


5. The Neural Basis of Dendrophilia 


I will end with a brief exploration of the neural basis of human dendro- 
philia. What neural changes occurred in our lineage to provide us the supra- 
regular computational machinery needed to process tree-like structures. 
Several initial remarks are in order. First, whatever circuitry is involved 
must apply to multiple modalities (both visual and auditory) and to both 
input and output (allowing both perception and generation of supra-regular, 
tree-formed stimuli). Second, if supra-regular abilities indeed are based on 
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augmentation of “normal” finite-state capabilities by some accessory form 
of memory (counter, stacks, queues or the equivalent), we could expect 
to see a disjunction in the neural processing of regular and supra-regular 
grammars. 

The most obvious place to look for neural circuitry involved in pro- 
cessing complex grammar is the inferior frontal gyrus (IFG), also termed 
Broca’s area (though strictly speaking this term should be reserved for the 
left hemisphere). Paul Broca himself thought of this region as a speech 
output region (Broca, 1861). However, since the seminal work of Zurif and 
Caramazza in the 1970s, it has been known that patients with damage to 
this region also exhibit difficulties in processing sentences with complex 
syntax (Caramazza and Zurif, 1976; Zurif et al., 1972), and the involve- 
ment of the IFG in syntactic tasks is now a very robust result from brain 
imaging studies (Friederici, 2012; Grodzinsky, 2000; Musso, et al., 2003; 
Price, 2012; Pulvermiiller, 2010; Udden et al., 2008). One recent result 
that is particularly convincing used a parametric design (which looks for 
correlations rather than using subtraction, and is thus immune to most of 
the critiques leveled at early brain imaging studies) where “chunk size” 
(the number of syntactically connected words in a string) was systemati- 
cally varied from one to twelve (Pallier et al., 2011). Additionally, this 
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study contrasted ordinary written French with “jabberwocky,” in which 
nonsense nouns and verbs were used but syntactic structure was preserved. 
The findings were clear: although several regions showed increase activa- 
tions with chunk size (presumably corresponding to semantic rather than 
syntactic binding), only the IFG was increasingly activated as chunk size 
increased even with the jabberwocky speech. Thus there can be little doubt 
that the IFG serves as an important “hub” in the neural network involved 
in processing complex syntax. 

Comparative neuroanatomical research has corroborated and extended 
this hypothesis. In particular, Schenker et al. (2010) used painstaking mi- 
croscopic examination of neuroanatomical slices to compare the size of 
IFG regions, defined by their cytoarchitecture, in humans and chimpanzees. 
They found that these regions, although present in chimpanzees, have been 
greatly expanded in the human brain, with Brodmann’s area (BA) 44 being 
6.6 times and BA 45 6 times larger than the same area in chimpanzees. For 
comparison, the primary visual area V1 of the human brain is just 1.4-1.8 
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times larger than in chimpanzees, depending on the hemisphere. Thus, these 
two IFG regions represent the most expanded regions of the human cortex 
that have been reported. 

Several recent studies from Angela Friederici’s lab have further explored 
these issues using the same grammars discussed above that have been tested 
with animals. When contrasting the A”B” and (AB)” grammars, these studies 
found (Friederici et al., 2006) that only the former, supra-regular grammar 
strongly activated the IFG, results confirmed by other work using complex 
grammars done in the same lab (Bahlmann et al., 2006; Bahlmann et al., 
2008). This is consistent with the previous work discussed above. More inter- 
estingly, though, this study examined the connectivity of the areas maximally 
activated by these two grammars as well, using an imaging method known 
as diffusion-tensor imaging (DTI) to track white matter bundles emanating 
from particular regions. They found that the two parts of the frontal cortex 
activated by these two grammars had fundamentally different connectivity 
patterns, with Broca’s area connecting, via dorsal pathways, to the parietal 
and temporal lobes, while the opercular regions associated with the regular 
grammar ran ventrally to the anterior temporal lobe alone. 

Consistent with these human data, a comparative study by Rilling and 
colleagues (Rilling et al., 2008) used DTI to compare the connectivity of the 
IFG in macaques, chimpanzees and humans. They found that this dorsal 
white matter pathway (termed the arcuate fasiculus) has both expanded 
in size, and its connectivity greatly expanded in area, in humans relative 
to these other species. Indeed in the macaque there is very little arcuate 
fasiculus at all (and all frontal temporal connectivity seems to derive from 
the ventral pathway), while in the chimpanzee brain the dorsal pathway 
exists but ends in the parietal lobe and the posterior-most temporal lobe. In 
humans this pathway forms one of the largest cortico-cortical connections 
after the intra-hemispheric corpus callosum. 

In summary, brain imaging results demonstrate that an important hub 
in the neural network that computes complex syntax in the human brain is 
formed by the IFG regions BA 44 and 45. Anatomically, these regions are 
the most expanded that are known, relative to chimpanzees, on both sides 
but particularly on the left (Schenker et al., 2010) and have also greatly 
expanded their connectivity via the dorsal arcuate fasiculus, relative to 
other primates (Rilling et al., 2008). This confluence of results strongly 
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suggests that an expansion of the neural hub centered on the IFG was a 
crucial occurrence during hominin evolution, and represents a key prereq- 
uisite of human syntactic abilities, and supra-regular abilities in particular. 
It further suggests the tantalizing hypothesis that the IFG itself acts as the 
general-purpose memory system that is a computational requirement for 
supra-regular abilities, serving as the “stack” or “queue” that stores in- 
termediate results from a more distributed finite-state system (Fitch 2014; 
Fitch and Friederici, 2012). 


6. Implications and Conclusions 


I have argued that a key component of the FLD is our derived capacity for 
complex syntax, specifically the supra-regular abilities that support human 
dendrophilia. Specifically, I suggest that we can partition human syntactic 
abilities into two components. The first is a finite state system, with a large 
long-term memory capacity that can store tokens and basic sequencing op- 
erations over tokens, or in some cases classes of tokens (e.g. combinations 
like bigrams or transitions from one token to the next). These are precisely 
the types of computational capacities that are required for human phonol- 
ogy (Heinz and Idsardi, 2013), itself a complex and rule-governed aspect of 
human language (Fitch, 2010; Heinz and Idsardi, 2013). The comparative 
data suggests that certain computations at this level of the formal language 
hierarchy are widely shared with other species. Although chimpanzee data 
remain scarce, they support the idea that regular-level processing is part of 
our CAS, ¢, — part of our shared ape heritage (Sonnweber et al., 2015). How 
broadly precisely which operations are shared remains a topic for further 
research. Many regular-level phenomena are much more complex than the 
few regular grammars that have been tested so far in animals (Jager and 
Rogers, 2012; Pullum and Rogers, 2006). Furthermore, most phenomena 
of human phonology have never been examined in animal models (Yip, 
2006), so there could be uninvestigated aspects of phonology that turn out 
to be uniquely human (Pinker and Jackendoff, 2005). 

The second component of human syntax is a supra-regular ability, and it 
is this alone that I hypothesize to be derived in the human lineage and thus 
part of FLD. These supra-regular computational abilities enable us to pro- 
cess multiple types of regularities and structures that would be inaccessible 
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to purely finite-state systems, including such features as symmetry (mirror 
grammars) or arbitrary repetitions of novel stimuli (copy grammars), as 
well as long-distance phrasal correspondences and multiple levels of hier- 
archical embedding that are central to linguistic syntax. This supra-regular 
capability can be clearly characterized using formal language theory, and 
its availability to humans has been clearly demonstrated in multiple sensory 
domains (vision and audition) in many different laboratories. 

Furthermore, I reviewed evidence supporting the neural hypothesis that 
these abilities are subserved by a network centered on the IFG. If this hy- 
pothesis is correct, one prediction is that the small and weakly connected 
IFG of chimpanzees can serve as a working memory for certain tasks, but 
not others, and that the key event in hominin evolution was an increase in 
the power and scope of this. This suggests that brain imaging experiments 
in chimpanzees might reveal what sorts of tasks activate the IFG in that 
species, and provide an indication of what types of computations might 
have served as precursors of complex syntax in early language evolution. 
Unfortunately, brain imaging in awake chimpanzees, especially performing 
a trained task, is anything but easy (Rilling et al., 2007), and we may be 
waiting for a long time to see results that clearly show IFG-specific acti- 
vation (as opposed to large swathes of the brain, as in Taglialatela et al., 
2008). Nonetheless, this prediction illustrates that we do not need to think 
of language evolution research as indulging in untestable “fairy tales” — us- 
ing the comparative method and modern neuroscientific methods we can 
generate and test specific concrete hypotheses. 

Why would dendrophilia have evolved? The capacity to infer and ma- 
nipulate tree structures provides many potential benefits (and indeed, any 
computer science textbook offers dozens of useful algorithms that rely upon 
tree representations, e.g. Skiena, 2008). These include sorting, planning, 
strategizing and generalizing to unobserved data. In particular, the ability to 
manipulate trees gives its possessor a powerful tool to structure thought, to 
explore novel combinations of items, and thus to construct novel hypoth- 
eses about the world. Such a powerful and abstract ability could apply in 
many domains, from social cognition and reasoning about others, to tool 
construction and use, to complex motor planning. I think that the manifest 
diversity of applications of dendrophilia extend beyond language to include 
music and the visual arts (e.g. architecture and ornament) and thus repre- 
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sent a key component of the human “sense of order” (Gombrich, 1979) 
that enables us to structure input from the world, imagine unseen entities 
(whether ghosts or electrons) and is central not only to language but to all 
of human thought. Thus I think that it would not be particularly fruitful to 
ask “which use came first” or “which benefit is most important” — rather I 
suggest that dendrophilia in general is one of the core components of our 
mind that makes us human. 
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Abstract: How come that Homo sapiens is the sole animal species communicating 
with a language (i.e., a human language)? Theorists of language evolution have 
mostly adopted a human-centered approach to address this question. This chapter 
discusses the limits of this approach and proposes an alternative that consists in 
studying the domain general functions that serve language comprehension and 
production from a comparative and evolutionary perspective. Special attention is 
given to domain general processes which allow humans and animals to integrate 
information in space and time, and thus develop perceptual and more conceptual 
abstract categories. This chapter presents illustrative studies that reveal the various 
aspects in which these integration processes differ in human and nonhuman ani- 
mals. Finally, we discuss the source of these species differences and their potential 
implications for our understanding of language evolution. 


Keywords: Animal communication, integration processes, categorization, language 
evolution, baboons 


1. Introduction 
1.1 Limits of strictly human-centered approaches 


In the various scientific disciplines interested in language evolution — anato- 
my, physiology, paleoanthropology, linguistics, psychology, neurosciences, 
computer sciences — the precisely defined architecture of human language 
is classically used as the basic reference, the canon, to which all nonhuman 
communication systems are compared. Said differently, the communication 
system of Homo sapiens — an approximately 200,000 year-old isolated spe- 
cies (there are no other living species in the genus Homo) with almost 1.5 kg 
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brain tissue, living in complex societies with a long history of cultural evo- 
lution — is used as the unique reference system in most comparative studies 
on language evolution (for a review, Hauser et al., 2002). Obviously, this 
makes sense given that the purpose of these comparative studies is not to 
build a descriptive catalogue of other animal communication systems but 
to better understand human language evolution. However, when pushed 
too far, such an anthropomorphic approach can be misleading because, by 
definition, nonhuman animals cannot equal human animals in performance 
when it comes to human language. A current caveat in human-centered 
comparisons consists in assuming non-explicitly that nonhuman cognitive 
architectures must resemble human cognitive architecture, in parts or as 
a whole. However, such an assumption can hold only if: 1) human and 
nonhuman cognitive architectures had followed similar evolutionary paths 
and were adapted to comparable environmental, social and biological con- 
straints, and 2) the cognitive architecture of each species was a construction 
made of independent (non-interacting) cognitive components that are not 
sensitive to developmental and phylogenic factors. Given that every species 
has a unique cognitive architecture, it seems like a vain enterprise to search 
for strictly identical components in humans and nonhuman animals. 
Consider one example: syntax. Different types of syntaxes have been 
formalized in Chomsky’s hierarchy, from simple (finite state) grammars 
to complex (supra-regular) grammars. In an attempt to better understand 
human cognitive “uniqueness”, major efforts have been put in the inves- 
tigation of nonhuman species’ ability to process supra-regular grammars 
(e.g., Fitch and Hauser, 2004; Gentner et al, 2006; Abe and Watanabe, 
2011). Unsurprisingly, nonhuman animals do not equal Homo sapiens in 
that particular type of linguistic computation. The very attempt to search 
for strict human-like syntax in nonhuman animals implies that there is only 
one way to compute information in a complex communication system: the 
human way. This view does not consider the possibility that each species, 
even phylogenetically close to Homo sapiens, might have developed its own 
original and complex -possibly multimodal- cognitive architecture, which 
does not include human-like syntax. It also neglects the existence of inter- 
actions between the various components of a cognitive architecture whose 
effects increase over phylogenetical and developmental time scales (syntax 
might not exist as an independent computational subsystem, not even in 
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humans, see Seidenberg and MacDonald, 1999). These interactions make 
it very unlikely that complex integrated levels of computation (syntax-like 
processes) in the cognitive architectures of two different species resemble 
each other. 

We must acknowledge, however, that more and more studies are look- 
ing for simpler forms of syntax in nonhuman animals. This approach has 
recently gained interest in the field of ethology and to some extent in lin- 
guistics (for a review, Schlenker et al., 2016; see also Petkov and Wilson, 
2012). It refers to basic learning principles that allow the extraction of com- 
binatorial semantics, statistical regularities, or adjacent and nonadjacent 
dependencies according to which qualitatively comparable basic learning 
principles are supposed to hold for human as well as nonhuman commu- 
nication sequence learning (e.g., Seidenberg et al., 2002). 

The narrow anthropomorphic approach of language evolution that we 
question here is not only found in linguistics but has dominated other 
domains as well, such as comparative anatomy. Rapid progress of imag- 
ing techniques over the past 20 years has spurred a frantic search for the 
anatomical landmark of language in Homo sapiens. After a couple of un- 
successful attempts to teach human vocal language to nonhuman primates 
(e.g., Hayes, 1951), it has become clear that our closest cousins — chim- 
panzees — are not able to pronounce human phonemes. The first proposed 
explanation for the incapacity to produce speech was that their larynx was 
too high (e.g., Lieberman, 1968, 1975). The existence of a physical limita- 
tion for speech in chimps has been admitted in the research community 
for many years. However, recent studies suggest that vocal tract anatomy 
cannot suffice to explain the absence of speech in nonhuman mammals 
(for a review, Fitch 2010). Fitch et al. (2016) inferred from X-ray videos 
and a modeling approach that macaca’s tract has the potential to produce 
a broad range of speech sounds. This was confirmed in one of our recent 
study in which we recorded the spontaneous vocalization of baboons (Papio 
papio, Boé et al., 2017). We found that the baboons produce sounds that 
share the acoustic F1/F2 formant properties of human [i æ a o u] vowels, 
and those baboon sounds were produced by movements of the tongue in a 
human-like articulatory space defined by two axes (anterior-posterior and 
superior-inferior, see Boé et al., 2017 for more details). Therefore, the in- 
ability of nonhuman primates to produce human phonemes is more likely 
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due to differences in the neural circuits that command oro-facial muscles, 
or to other cognitive differences, than to differences in the anatomy of the 
vocal tracts. 

Other anatomical factors have been proposed to explain the uniqueness 
of language in the human species. Most of them concern human-centered 
brain features. Neuroscientists have been looking for the homologue struc- 
tures of Broca’s area and Wernicke’s area in the brain of various apes. 
Broca and Wernicke are important regions of the human brain whose injury 
provokes severe language disorders, the so-called Broca’s and Wernicke’s 
aphasias. Because of interspecific anatomical differences, direct comparisons 
of brain regions across species are difficult to make. Instead, researchers 
have compared brain asymmetries across species. In the human brain, the 
language function is mostly hosted in the left hemisphere. Broca’s area, 
Wernicke’s area and the planum temporale are bigger in the left than in the 
right hemisphere (Geschwind and Levitsky, 1968). Those asymmetries, that 
were first thought to be unique to humans, have been observed as well in 
nonhuman primates (e.g., Cantalupo and Hopkins, 2001; Cantalupo et al., 
2003). In a recent paper, Gomez-Robles and collaborators (2013) propose 
that, at a whole brain scale, there is continuity in asymmetric variation 
between humans and chimpanzees: similar brain asymmetries exist in both 
species, even though the human brain tends to be a little more asymmetric 
and more sensitive to developmental constraints. 

More recently, tremendous efforts have been put into the identification of 
precise brain regions or neural structures that might be unique to humans 
and explain the emergence of language in our species. For example, Leroy 
et al. (2014) have proposed that the superior temporal sulcus (STS) critically 
differs in human and nonhuman species: human STS is deeper in the right 
hemisphere compared to the left, and this depth asymmetry is not found 
in chimps. Given that STS is central in the perisylvian language region, 
this particular landmark could be a promising “human-only” candidate. 
However, as the authors acknowledge, the link between this anatomical 
feature and the language ability in humans remains loose: MRI anatomical 
measurements made on various groups of human subjects show that the STS 
depth asymmetry is bigger in men than in women, it persists in children with 
impaired language development, and is unchanged in adults with reversed 
language lateralization (situs inversus). To drive the point home, an even 
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more recent study shows that the very same STS brain asymmetry exists in 
baboons (Meguerditchian et al., 2016). Further investigations on the rela- 
tionship between variation in human brain anatomy (at a brain network 
scale) and variation in human language function are needed before one can 
make between-species comparisons and draw convincing conclusions about 
the role of precise human brain features in the evolution of human language. 

The literature on human brain landmarks for language also raises a 
crucial question: what can gross anatomy tells us about fine-grained and 
complex interacting cognitive functions, in particular in a comparative 
perspective? Without a well-defined theoretical model of the anatomy-to- 
function relationship in each species that would make the comparison pos- 
sible, the explanatory power of anatomy brain differences remains very 
limited. This would not be the case if we knew precisely the functional 
significance of those particular brain regions in their neural networks, in 
both species. For example, recent studies have shown that the homologue of 
Broca’s area in chimps is involved in communicative behaviors (Taglialatela 
et al., 2008; 2011), however little is known about the type of computation 
it makes, or about the way it deals with a combination of communicative 
gestures and vocalizations, or about the way this region interacts with 
deeper structures involved in emotional vocalizations (Jiirgens, 1979). Just 
like the linguistic/cognitive human-centered components, the anatomical 
landmarks of language -when strictly human-inspired (and functionally 
underspecified)- seem of very little explanatory value to understand lan- 
guage evolution in the human lineage, at least in the current state of our 
scientific knowledge. 

The search for a unique (cognitive or anatomical) key factor at the origin 
of human language will inevitably lead to a stalemate. Brains, just like the 
cognitive function they host, are shaped by species-specific phylogenetical 
and developmental trajectories. They are complex systems and the dif- 
ferences between human and nonhuman brains and functions cannot be 
easily reduced to single “keys” components, especially when a strict human- 
centered definition of these components is applied. 
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1.2 An alternative ... 


What could be the alternative to this narrow human-centered approach? 
In this chapter, we propose to step backward to enlarge our view of lan- 
guage evolution. Rather than focusing on the presence or absence of strictly 
human-inspired language features in nonhuman animals, we believe that an 
alternative is to examine the background of the language function, namely 
the inherited domain-general elements of “the machinery required to mas- 
ter human language” (Saffran and Thiessen, 2008), that might be used 
in nonhuman species for communicative and/or other purposes. Domain- 
general mechanisms correspond to learning devices that apply to a variety 
of different cognitive functions, as opposed to domain-specific mechanisms 
that are dedicated to specific cognitive functions. The opposition between 
domain-general versus domain-specific mechanisms was first proposed in 
a cognitive development framework where Skinner’s view (1957) was op- 
posed to that of Chomsky (1959). In the present chapter, we use the notion 
of domain-general mechanism in a more evolutionary perspective. This 
idea we defend here is that the human language-device is mainly made 
of domain-general elements, some of which are shared with other spe- 
cies either because they were present in a common ancestor or they result 
from convergence processes. These inherited cognitive components are very 
likely to take different forms in different species, as a function of the cogni- 
tive domain they are involved in, because different domains show different 
regularities and constraints that shape these components. However, we 
expect that close comparison between human and nonhuman performance 
will reveal what aspects of those domain-general cognitive components are 
shared across species. The underlying hypothesis we uphold here is that 
complex and phylogenetically recent cognitive functions, like language, are 
probably the product of intense re-use and re-combination of subsets of 
inherited anatomical, cognitive, behavioral components (Anderson 2010). 
Phylogenetically close species might share some (but not all) of these com- 
ponents, as a support of communication and/or other cognitive functions. 
For example, the serial organization and structuration of elements that we 
find in the processing of syntax might as well serve the planning of complex 
motor sequences in humans (Koechlin and Jubault, 2006), in other primates 
or in birds, including the sequences of bird’s songs (Suzuki et al., 2016). 
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In the remaining part of this chapter, we will firstly describe some of the 
multilevel integrative processes that are critically involved in human lan- 
guage, which likely evolved from domain-general inherited functions. We will 
then consider and discuss the main findings of the literature on comparative 
cognition regarding these domain general mechanisms. In this context, we 
will pay special attention to the mechanisms by which animals integrate 
stimulus information at various levels, from low level perceptual grouping 
mechanisms that lead to global percepts, to more cognitively complex pro- 
cessing mechanisms that make it possible to associate meanings to objects or 
categories for example. Importantly, this chapter is not aimed at making an 
exhaustive review of the literature regarding these cognitive processes. Our 
goal is to document potential species differences and similarities concerning 
these processes, and to illustrate these differences and similarities by a selec- 
tion of suggestive findings. In a final section of this paper, we will discuss the 
potential impact of these results on our understanding of language evolution. 


2. Integration processes in nonhuman animals 
2.1 Integration in time and space 


Language production and comprehension involve the processing of a continu- 
ous flow of information, and a temporal integration of the different linguistic 
elements. There is an obvious connection between sequential learning and 
language, because these two cognitive processes require the extraction and 
further handling of elements occurring in temporal sequences. However, be- 
cause most behaviors have a temporal structure, the capacity to relate events 
in the time dimension might derive from a domain-general function that 
is involved in many different contexts, beyond language. We will consider 
below four different aspects of sequence processing, which are the ability to 
(1) remember a sequence of events, and to process that sequence as tempo- 
rally ordered, (2) to learn the transitional probabilities of items occurring in 
sequence, (3) to learn and process nonadjacent dependencies, and therefore 
to know that event A is followed by event B with intervening events between 
them, and (4) to gain information regarding the general structure of the 
sequence (i.e., syntax as concerns human language). For all these problems, 
we document below which aspects of sequence processing seem to be shared 
by animals and humans, and which aspects seem more restricted to humans. 
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2.1.1 Serial list learning 


List learning corresponds to the capacity to remember items in a list, and 
to remember their temporal ordering. Nonhuman animals can learn and 
remember lists of items, (e.g., Terrace et al., 2003). Comparative investi- 
gations of serial list learning in animals, especially pigeons and monkeys, 
suggest that the memory of lists of items is not qualitatively different from 
what is found in humans. When taught that A<B, B<C, C<D, D<E, various 
nonhuman species have been able to properly order previously un-trained 
test pairs such as C<E. Pigeons and monkeys show serial position effects 
(Wright et al., 1985) sharing similarities with those of humans. Monkeys 
also show the symbolic distance effects found in humans: their latency to 
compare two items is shorter when the items are far apart in the list (e.g., 
AC vs CD; Colombo and Frost, 2001). All these discoveries show that 
animals are capable of developing representations of series of items based 
on their ordinal position. Although the succession of words in a sentence 
is not strictly equivalent to a succession of unrelated items, serial list learn- 
ing capacity is probably necessary in both cases. These results encourage 
the view that the underlying features of the learned representations are 
shared by human and nonhuman animals, suggesting quite old evolutionary 
mechanisms for item list learning. 


2.1.2 Processing of adjacent dependencies and chunking 


Adjacent dependencies refer to a predictive relationship between one event 
and the event immediately following it in the sequence. Consider two three- 
item sequences, the sequences A-B-C and A-B-D, which are presented an 
equal number of times. In this very small corpus composed of two se- 
quences, A is always followed by B and B is followed half time by C and 
half time by D. The transitional probability between A and B is thus equal 
to 1, while the transitional probability between either B and C or B and D 
is equal to .5. Consideration of the transitional probabilities is one of the 
mechanisms promoting the learning of auditory and visual sequences in 
humans (e.g., Hunt and Aslin, 2001). It is one of the mechanisms by which 
children learn word boundaries and segment speech streams into words: in 
a sentence, the transitional probability between the last syllable of a word 
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and the first syllable of the following word is lower than the probability 
between two successive syllables within one word (Saffran et al., 1996). 

Comparative studies have shown that a high transitional probability 
between two AB items facilitates the processing of the second B item in 
pigeons and monkeys. Froehlich et al. (2004), for instance, tested pigeons 
in a serial response time task requiring to peck a stimulus appearing se- 
quentially at three possible locations and in a predefined order. The transi- 
tional probabilities between the stimulus locations were controlled in this 
task, and the authors report that response time to peck is a direct func- 
tion of these probabilities. Thus, high transitional probabilities gave rise 
to short response times at the second location of the considered pair, while 
low transitional probabilities gave rise to longer response times. These au- 
thors directly compared their results on pigeons to Hunt and Aslin’s study 
conducted on humans (2001), and report that although slower, pigeons 
processed information at roughly the same rate as humans, as reflected in 
similar overall regression slopes (Figure 1). 


Figure 1: Average response times of pigeons and humans depending on transitional 
predictability of the items in a sequence. Results on pigeons are from Froehlich 
et al. (2004), those on humans are from Hunt and Aslin (2001). Also shown 
are linear regression lines for the two sets of data. Pigeons responded slower 
but processed information at roughly the same rate as reflected in similar 
overall slopes. Figure adapted from Froehlich et al. (2004). 
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In humans, the processing of transitional probabilities between adjacent 
elements comes with another type of processing known as chunking that 
consists in clustering information within sequences (e.g., the syllables in 
a word). Chunking in humans occurs during the memorization of verbal 
items, but chunking is not restricted to verbal material, and can be found as 
well in the visual domain (e.g., Orban et al., 2008). Experimental evidence 
suggests that the capacity to organize sequences in chunks is also in the 
scope of numerous animal species (rats: Fountain, 1990; pigeons: Terrace, 
1991; tamarins: Hauser et al., 2001; baboons: Minier et al., 2016). In a re- 
cent unpublished experiment involving a serial response time task, we found 
for example that baboons organized 9-items sequences in three chunks of 
three items each, and that these chunks precisely included the items shar- 
ing the highest transitional probabilities (Minier et al., 2016). Therefore, 
is seems that human and nonhuman animals are prone to statistical learn- 
ing making use of transitional probabilities to both segment streams of 
(visual or auditory) information and organize the elements composing these 
streams into chunks. 


2.1.3 Nonadjacent dependencies 


Equally important for language is the capacity to detect and learn non- 
adjacent dependencies. Consider the following sequence structure A-X-B: 
A is followed by a variable item X, and item X is systematically followed 
by B. Given this structure, there is a nonadjacent transitional probability 
of 1 between A and B. Learning nonadjacent dependencies is important 
for language. For instance, a listener has to detect the relation between 
the subject and the verb in a sentence despite the presence of intervening 
words such as adverbs. This capacity can also be useful in very different 
nonlinguistic contexts, for example when we have to detect a systematic 
relation between two events separated in time (e.g., the ring of the doorbell 
signaling that someone is coming), and irrespective of the intervening other 
events. Experiments have shown that humans and animals can both process 
nonadjacent dependencies in temporal sequences of events, although non- 
adjacent probabilities are more difficult to detect and learn that adjacent 
dependencies (humans: Newport and Aslin, 2004; tamarins: Newport et al., 
2004; rats: Fountain and Benson, 2006). Moreover, there are similarities 
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between species regarding the factors that affect the learning of nonadjacent 
dependencies (e.g., facilitatory effect of perceptual similarity of the nonad- 
jacent elements in both humans and monkeys; humans: Creel et al., 2004; 
Gebhart et al., 2009; squirrel monkeys: Ravignani et al., 2013). 

However, animals may have more difficulties when presented with com- 
plex sequences. Wilson et al. (2015) tested 2 monkeys and 33 humans using 
an auditory artificial grammar containing both adjacent and nonadjacent 
(long-distance) relationships. After an initial exposure to the sequences, the 
subjects from the two species were exposed to sequences containing viola- 
tions of either the adjacent or both adjacent and nonadjacent relationships. 
Both species showed sensitivity to adjacent transitions, but only humans, 
and even roughly half of them, indicated significant sensitivity to nonadja- 
cent dependencies. Wilson et al. (2015) concluded that in some conditions, 
nonadjacent probabilities are less salient in macaques than in humans. 
Although replications and extensions are required, this study suggests that, 
compared to monkeys, humans have a greater facility to deal with several 
dependencies of different types (i.e., both adjacent and nonadjacent) at the 
same time. 


2.1.4 Learning of sequence structure 


Learning the structure of a sequence requires the extraction of the rela- 
tionships between the constitutive elements of that sequence. This kind of 
learning probably supports, among others, the encoding of grammatical 
and syntactic linguistic regularities (e.g., in most German sentences, the 
verb is at the end). However, such structural regularities also exist in many 
other (non-linguistic) domains, such as the motor domain. For instance, 
Byrne et al. (2001) reported that the preparation of food items requiring 
complex manipulations (thistle leaves) in wild gorillas follows a hierarchical 
sequential organization. 

Marcus et al. (1999) have shown that 7-month-old infants can quickly 
learn that sequences of auditory stimuli follow an ABB or an ABA structure. 
The processing of such structures was also studied in nonhuman animals, 
but the results were quite inconclusive: zebra finches (Heijningen et al., 
2012), rats (Toro and Trobalén, 2005) and even rhesus macaques (Procyk 
et al., 2000) do not seem to catch the difference between the ABA vs ABB 
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structures. One study compared zebra finches and humans using the same 
experimental procedure and stimuli (Chen et al., 2015): zebra finches did 
not learn these two types of structures, while humans learned them readily. 
There is to our knowledge only one study in which an animal species could 
successfully learn sequence patterns of the ABA/ABB type. This study from 
Spierings and Ten Cate (2016) compared two avian species, budgerigars 
and zebra finches, and obtained positive results in the former. From this set 
of experiments, we can conclude that learning the structure of sequences 
might very well be in the scope of some nonhuman species, but this ability 
is clearly not as developed as it is in humans. 


2.1.4 Integration in space 


We documented above that human language function makes great use of 
temporal information, but spatial information is also crucially important 
for language. Infants at very young age learn that words and communica- 
tive gestures refer to entities in their immediate surrounding space. Words 
also make it possible to refer to objects that are spatially (or temporally) 
absent (the displacement feature proposed by the linguist Charles Hock- 
ett in 1960 to characterize human language). Finally, nonverbal forms of 
language, such as sign language or writing gestures strongly rely on an 
encoding of spatial information. No doubt that being capable to integrate 
and combine information in the spatial domain is also a domain general 
function important for a complex use of language. 

In the late seventies, Navon (1977) has shown that human subjects 
tend to process the global shapes of visual object before they process 
their constitutive (local) details. This effect has been named the “global 
precedence effect” and is often considered to be an attentional phenom- 
enon. Global precedence in humans was demonstrated in experimental 
research using large letters (global shape) made of smaller letters (local 
features). The degree to which animals perceive the global properties of 
the visual input in comparison to more featural ones has been an issue 
in animal cognition for some time. In our laboratory, we explored this 
effect in baboons (Deruelle and Fagot, 1998; Fagot and Deruelle 1997) 
using large shapes (square, circle, cross) made of smaller shapes (again 
square, circle, and cross) as stimuli. In our tests, the baboons were re- 
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quired to either match (Fagot and Deruelle, 1997) or identify (Deruelle 
and Fagot, 1998) these stimuli considering either their global of local 
structure. For comparative purposes, humans were also tested in the same 
experimental conditions. 


Figure 2: Processing of the global/local stimulus structure in humans and baboons. 
Left: Illustration of the stimuli used with humans and baboons in Fagot 
and Deruelle (1997). This experiment required to match hierarchical 
stimuli considering their global or local structure (in this local trial, they 
have to match the circle made of circles, with the square made of circles, 
considering their common local features). Right: percentage correct 
obtained in humans and baboons in global and local trials. Humans 
showed an advantage to process the global structure of the stimuli, while 
baboons showed a local advantage. This local advantage in baboons is 
accounted for by a general difficulty to “group” the local elements of the 
hierarchical globalNocal stimuli into a global whole. Humans are much 
less sensitive than monkeys to the spatial distance separating the local 
elements (e.g., Deruelle and Fagot, 1998). Figure adapted from Fagot 
and Deruelle (1997). 
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These experiments revealed a striking species differences in global-local 
processing (Figure 2). Human participants exhibited the global advantage 
already found by Navon (1977), whereas baboons demonstrated their best 
performance and fastest response times in the local condition. Several ex- 
periments were conducted to understand the cause of this human-baboon 
difference, which suggested that the performance of the baboons strongly 
depended on the distance separating the local elements (Deruelle and Fagot, 
1998): when the distance was enlarged, the strength of the local bias in- 
creased and this effect was amplified in baboons compared to humans. This 
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effect has been replicated many times in pigeons, capuchin monkeys and 
chimpanzees (for reviews, see Fagot and Barbet, 2006; Fagot and Parron, 
2012). 

At first glance, one might consider local precedence in animals as a purely 
perceptual/attentional phenomenon, but we propose that it is more than 
this. One of the main properties of language is that it can convey informa- 
tion about things that are not immediately present (spatially or temporally; 
Hockett, 1960). This displacement feature is crucial in the comparison 
between human language and other forms of primate communication. We 
have previously proposed (Fagot and Barbet, 2006) that a strong local bias 
limits the processing of the relation between and among objects. This effect 
is for instance demonstrated in Fagot and Parron (2010), showing that an 
increase in the separation of the distance between two bars of either identi- 
cal or different colors limits the classification of this stimulus on a same/ 
different relational basis. We suggest that a strong bias in favor of a local 
processing mode (in either the spatial or temporal domain) in nonhuman 
animals might place important constraints on their communicative systems. 
It reduces the possibility to make non (temporally or spatially) adjacent 
relations between or among the communicative signals, and between or 
among the communicative signals and the objects in the real word to which 
they refer, especially when they are far or absent. 


2.2 Integration of stimulus dimensions and sensory modalities. 


Animals, including humans, live in a rich world of information, and dealing 
with this complexity is probably critical for the survival of every species. 
The processing of this complexity may be achieved by a variety of cognitive 
mechanisms, which includes, among others, the integration of the differ- 
ent perceptual dimensions into single entities (i.e., integration of stimulus 
dimensions), and the grouping of various exemplars of a given object into 
categories (e.g., object categorization; see a discussion of this issue below). 
Interestingly, these functions, which are of general adaptive value, are all 
critical for a multidimensional/multimodal system, such as language. For 
instance, speech comprehension requires that multiple prosodic (e.g., into- 
nation, stress) and phonemic (e.g., voice onset time, place of articulation) 
dimensions present in the acoustic signal are processed and integrated. 
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Language comprehension is also achieved using a multimodal/multisensory 
mode of processing when various auditory and visual information are si- 
multaneously integrated (e.g., for the mapping of the lip movements with 
the auditory signals). 


2.2.1 Combining multiple stimulus dimensions 


Discrimination tasks are often proposed to animals as a mean to assess 
their perceptual abilities. Typically, stimuli are presented to the subjects, 
and the subjects’ behavioral responses to some perceptual dimensions or 
combinations of dimensions are reinforced, while responses to the non- 
relevant dimensions are not. Evidence suggests that animals can discrimi- 
nate stimuli along various dimensions: color, shape, luminance, orientation 
of the visual objects, or the pitch of auditory stimuli. Animals can as well 
base their behavioral responses on combination of two or more stimulus 
dimensions. For instance, Cook (2001) showed that pigeons can learn to 
select the computer screen area where horizontal green lines are presented, 
while avoiding the screen areas showing non-green lines and green lines in 
a non-horizontal orientation. 

In their review article, Lea and Wills (2008) comment on three main 
trends emerging from the literature on learned discrimination in nonhuman 
animals. The first one is that unidimensional discrimination is easier to learn 
than multidimensional discrimination based on combinations of features or 
conjunctions. Smith et al. (2012) have for instance trained monkeys to sort 
sine wave gratings depending on their orientations or both the orientation 
and spatial frequency considered in conjunction. Learning was much faster 
in the unidimensional than in the bi-dimensional test condition. Another 
example of this effect comes from research on conceptual discrimination 
by monkeys (e.g., D’Amato and van Sant, 1988) and the demonstration 
that discrimination performance relies strongly on an analysis of features, 
such as color, rather than on configurations of features. Lea and Wills’ 
(2008) second conclusion is that when the stimuli are made of multiple 
relevant dimensions, nonhuman animals express a tendency to focus their 
attention on one dimension only, mostly when this dimension has suffi- 
cient discriminative values. There are also multiple examples of this trend 
in the literature. In our laboratory, we found that baboons discriminated 
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computerized human faces considering exclusively the contour of the face 
(Martin-Malivel and Fagot, 2001) or pixel luminance information (Martin- 
Malivel et al., 2006), instead of the multiple levels of information (e.g., 
configural information, identity, etc.) that the facial stimuli may provide, 
and that humans process. Finally, Lea and Wills’ (2008) report than even 
with ingenious experimental designs, attempts to force nonhuman animals 
to process multiple aspects of the stimuli mostly lead to failures. This can 
be nicely illustrated by Dépy et al. (1997). The baboons in this study were 
initially trained to discriminate between two categories of stimuli defined 
by the possession of any combination of two out of three possible binary 
features. Baboons could sort these two classes of stimuli to a good accuracy 
level, albeit after a long training process of several thousands of trials, but 
remained unable to take the three discriminative features into considera- 
tion to achieve this performance, two of the three features taking a leading 
role in the task. 

Wang et al. (2015) recently recorded the brain activity (IRMf) of rhesus 
monkeys and humans in two test phases. In the first phase, subjects from 
the two species perceived passively sequences of four tones, the last one 
being either of a lower or higher pitch than the first three. After this habitu- 
ation procedure, the same subjects perceived sequences violating the general 
structure used during the habituation phase. Thus, some test sequences con- 
tained a number of tones different from the habituation sequences (number 
deviant), some other sequences contained four tone units with identical 
pitch (sequence deviant), and a last set of sequences differed from the ha- 
bituation sequences regarding both the number of items and pitch. In both 
species, homologous brain areas were particularly responsive to violations 
in number (intraparietal and dorso premotor areas), and sequence (ventral 
prefrontal and basal ganglia), but humans were the only primates showing a 
joint sensitivity to both factors in the perisylvian language region (bilateral 
inferior frontal and superior temporal gyri). One limitation of this study 
is that an absence of evidence monkeys is not the evidence of an absence. 
Although this study does not address directly the relationship between brain 
and behavior, its results suggest that the perisylvian region is involved in 
humans only in the integration of various stimulus dimensions contained 
in auditory sequences of stimuli. 
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2.2.2 Combining multiple sensory modalities 


The question of multimodal integration warrants a discussion in this sec- 
tion, due to its relevance to the origin of language. Evidence suggests that 
the ability to integrate information across sensory modalities in not at 
all restricted to humans. Cross-modal integration was demonstrated for 
instance using task requiring the processing of multi-sensory stimuli (Lanz 
et al., 2013), cross-modal interference tasks in baboons (Martin-Malivel 
and Fagot, 2001), and cross-modal matching tasks in chimpanzees (e.g., 
Davenport and Rogers, 1970). Unfortunately, we are aware of only one 
study in which the performance of humans and nonhuman animals were 
directly compared in cross modal tasks using the same stimulus material. 
This experiment from Fagot et al. (2000) requested the subjects to catego- 
rize pictures of humans and baboons in one condition, and human and 
baboon vocalizations in another one. In this experiment, the subjects of 
the two species perceived a prime prior to the presentation of the stimu- 
lus to be categorized. Depending on the condition, the prime could be a 
picture or a vocalization of baboons or humans and three conditions were 
tested: intra-modal visual-visual priming, intermodal auditory-visual prim- 
ing, and intramodal visual-auditory priming. Three subjects out of four in 
each species demonstrated intra-modal priming. Inter-modal priming was 
demonstrated in the three out of four human subjects in the auditory-visual 
condition, and all four in the visual-auditory condition, but it was only 
found in one baboon out of four in each intermodal condition, suggesting 
that inter-modal integration is more difficult in baboons than in humans. 
Given the small number of subjects involved in this study, a replication is 
warranted before drawing any firm conclusion on the evolution of intra- 
modal integration. 


2.3 Categorization and conceptual integration 


A central aspect of human cognition is our ability to form categories of 
various kinds of objects or mental entities. Categorization implies that the 
exemplars of each category are grouped into classes considering physical or 
more abstract properties. Categorization is a domain-general ability that is 
fundamental for a variety of more specialized functions (e.g., inference or 
decision making), including the language function considered in this book. 
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Historically, Herrnstein and Loveland (1964) were the first to demon- 
strate categorical abilities in nonhuman animals. They showed that pi- 
geons could sort pictures of humans and pictures devoid of humans in two 
open-ended categories. Since this study, numerous papers have confirmed 
that various animal species can efficiently categorize stimuli considering 
low level perceptual stimulus dimensions, such as the pitch for auditory 
stimuli, and the color, shape, size, motion, orientation, luminance for visual 
stimuli (see for instance Berg and Grace, 2011, where pigeons were trained 
to categorize sine-wave disks considering their spatial frequency and ori- 
entation). However, the ability of nonhuman animals to apply categorical 
processes to more abstract — human-like — stimulus dimensions remains a 
matter of debate. 


2.3.1 Equivalence classes: grouping arbitrary items within the 
same category 


An important aspect of language is its arbitrariness. Arbitrariness corre- 
sponds to the fact that nothing in the physical form (acoustic properties) 
of most words refers to the objects they designate. For example, the word 
“car” does not “sound” like the vehicle it refers to. Therefore, words can 
refer to things in the real world, and things can refer to words, although 
there is no natural or necessary connection between them. In that case, 
words and objects are linked by a relation of equivalence, and the many ex- 
emplars of a given category of object (e.g., many different tables that vary in 
shape and color) can be categorized under a unique word label. The ability 
to form arbitrary connections between words and objects during ontogeny 
probably comes from the many co-occurrences of words-objects pairs that 
infants encounter during development. We presume that this ability relies 
on a more domain-general capacity to make associative (arbitrary) connec- 
tions between items, or categories of items. 

Experiments on “stimulus equivalence” have directly addressed the ca- 
pacity of human and nonhuman species to group arbitrary items into cat- 
egories, on the basis of their associative history (Sidman and Tailby, 1982). 
The prototypical design of stimulus equivalence experiments is shown in 
Figure 3. In stimulus equivalence experiments, the subject first learns a 
network of associations (shown in black in Figure 3) with repeated expo- 
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sures to these associations. Then, probe trials (shown by the red arrows 
in Figure 3) test if the subject (1) can associate each stimulus to itself (e.g., 
associate A to A, reflexivity relation), (2) can revert the trained relations 
(e.g., associate B to A, symmetry), and can associate the stimuli that have 
a common associate (associate A to C, transitivity). According to Sidman 
and Tailby (1982), stimulus equivalence is fully shown if the subject dem- 
onstrates, without further training, the relations of reflexivity, symmetry 
and transitivity in post training trials. One may easily imagine the serious 
limits of a cognitive system that would fail in this task. The formation of 
equivalence classes has the power to permit to use stimuli interchangeably, 
and is probably the corner stone of complex symbolic thought. 


Figure 3: Typical paradigm for experiments on stimulus equivalence. In this 
experimental design, the black arrows illustrate the trained associations. 
The dotted arrows illustrate the untrained associations that emerge in 
humans after an initial training phase (e.g., Sidman and Tailby, 1982). 


Equivalence relations emerge early in human infancy (23-month old, Lip- 
kens et al., 1993). By contrast, the formation of equivalence classes seems 
especially difficult for nonhuman animals such as pigeons or monkeys, and 
among the three relations sustaining stimulus equivalence, the relation of 
symmetry seems the most difficult one to acquire. Lionello-DeNolf (2009) 
reviewed a total of 24 articles on symmetry testing, and found that the vast 
majority of these articles reported negative results. Moreover, the handful 
of articles reporting more positive findings all used a very small number 
of subjects who received special training procedures, such as long training 
combining symmetry and reflexivity trials (Frank and Wasserman, 2005) 
or forms of symbolic training (Pepperberg, 2006). 

A recent paper from our research group nicely illustrates that difficulty 
(Medam et al., 2016). We trained baboons to associate pictures of bears and 
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pictures of cars to two different shapes which served as category labels. The 
baboons demonstrated category learning in this task, but failed to respond 
correctly when asked to reverse the trained relation and to associate the 
category labels to the pictures of cars and bears. The immediate conclu- 
sion from this study and others (see Lionello-DeNolf, 2009) is that the 
preferred mode of processing in nonhuman animals is not favorable to the 
spontaneous emergence of symmetrical relations, which is fundamental for 
the emergence of equivalences classes. Arbitrariness, as defined by linguists, 
requires arbitrary, bidirectional associations between the words and their 
referent (e.g., the written “CAT” refers to the cat animal, as the cat animal 
refers to the “CAT”). Evidence suggests that nonhuman animals too can 
form arbitrary associations between various items with no logical connec- 
tions between them, but they apparently have great difficulties to process 
these associations as bi-directional. 


2.3.2 First order relations 


Nonhuman animals can master a broad range of discrimination tasks, and 
some of them involve the processing of first-order relations. First-order rela- 
tions refer to spatial or more abstract relations among objects, such as the 
fact that an object is above or below another one, or that two objects have 
the same functions. Particularly important for our linguistic system is the 
first-order relation of sameness/differentness. Our language makes great use 
of categories, and the abstract concept of sameness is essential for the develop- 
ment of verbal categories. The concept of sameness can provide the basis for 
the most complex cognitive operations, such as the conservation of volumes 
or areas, or analogical reasoning (see below). Without this concept, we would 
be unable to understand sentences such as “This is a cat!”, and to get from 
this sentence the idea that the animal we see belongs to the cat category. We 
would as well be unable to understand sentences such as “It is warm again”, 
suggesting a similarity between the current and past weather. Comparative 
psychologists have shown that a variety of nonhuman animal species, such as 
the chimpanzee (Premack, 1983), the baboon (Wasserman et al., 2001) and 
the rat (Wasserman et al., 2012), succeed in learning same-different relational 
tasks. However, the nature of the mechanisms supporting this competence in 
nonhuman animals, and their similarity with humans, remains unclear. 
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Wright and Katz (2006) asked the following question: how much training 
nonhuman animals need to form the same/different concept? To answer this 
question, they tested pigeons, capuchin monkeys and rhesus monkeys with 
the same test design. Animals from the three species saw two pictures in suc- 
cession on a touch screen, and two kinds of trials were distinguished. In the 
“same” trials, the second picture was identical to the first one, while it was 
different from the first one in the “different” trials. In both kinds of trials, 
a white key was always displayed on the right of the second picture. When 
the first and second pictures were identical (identical trials), then the subject 
was asked to touch/peck the second picture to obtain a reward. If the second 
picture was different from the first one (different trials), then a touch/peck at 
the white response key was considered correct. The monkeys needed much 
less items (about 32) to develop the concept of sameness, than did the pigeons 
(256). The number of trials children would need in this task is not known, but 
studies have shown that children can categorize cats as different from dogs 
with only 12 training exemplars (Quinn et al., 1993), and by 10 months of 
age, they can form categories with only 7 or 8 training exemplars (Younger 
and Cohen, 1986). The data therefore suggest an evolutionary trend in this 
ability: humans would require exposure to a smaller number of items than 
the other animals, to form categories and develop same/different concepts. 

The concept of sameness can be applied to a broad range of attributes, 
from the most perceptual to the most abstract ones, and another interesting 
issue in the comparative literature is to know if animals use the same kind 
of information as humans, when solving similar tasks requiring an abstract 
concept of sameness/differentness. This kind of questions has been addressed 
extensively by Wasserman and collaborators (see review in Wasserman et al., 
in press; Wasserman et al., 2004). These authors trained pigeons, baboons and 
humans to categorize displays resembling those of the top of Figure 4. They 
consisted in arrays of 16 icons which were either all same (same relation) or 
all different (different relation). After they received this category training, the 
subjects were tested with arrays containing mixtures of icons, in which some 
icons were duplicated a number of times in the array. The authors reasonned 
that if the subjects have formed the concept of sameness, then they should 
classify the arrays containing at least one item different from the others as 
“different arrays”, irrespective of the fact that some icons are repeated. Fig- 
ure 4 illustrates the most substantial findings of this set of experiments. This 
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figure indicates the percentage of different responses with the mixtures, as a 
function of the entropy of the stimulus. Entropy in this experiment should be 
understood as a quantification of the perceptual variabity of the array: the 
all-same arrays have an entropy of 0, and the all different-arrays have the high- 
est possible entropy value of 4 (mixtures have intermediate entropy values). 


Figure 4: Use of perceptual (entropy) cues by pigeons, monkeys and humans, 
in a same/different discrimination task. The top panel shows the kind 
of displays employed in Wasserman and collaborators’ experiments 
(Wasserman et al., 2004, left: same array, right: different array). The 
bottom figure shows that the same-different response of monkeys and 
pigeons is controlled in this task by the entropy of the arrays while 
80% of the humans humans used more abstract cues in this task. Figure 
adapted from Wasserman et al. (2004). 
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The bottom part of Figure 4 shows that the behaviour of the baboons and 
the pigeons was controlled by the entropy of the arrays, which is a percep- 
tual cue. Response from a subset of humans (about 20% of the group) was 
also largely controlled by the entropy of the arrays, but this constraint was 
realeased in most of the subjects (80%) who treated the arrays containing 
at least one item different from the others as illustrations of the “different” 
concept. It can therefore be concluded that humans expressed more abstract 
judgments than pigeons and baboons in this task. We will not present in 
this chapter the full series of experiments using this kind of stimuli with 
animals (for a recent review, see Wasserman et al., in press), but the reader 
should be aware that clear demonstrations also exists that monkeys can 
also base their same/different responses on abstract cues independently of 
the entropy of the stimuli (Flemmig et al., 2013). Nevertheless, although 
several animal species seem capable of abstract same/different judgements, 
humans, more readily than other animals, apply qualitative, rule-based 
frameworks on the Same-Different discrimination task. 


2.3.3 Analogical (second-order) relational processing 


Our linguistic systems make great use of analogies, and our capacity to 
produce and understand analogies is considered by many as the “the Fuel 
and Fire of Thinking” (Hofstater and Sander, 2013). Developmental studies 
have shown that analogical reasoning is facilitated in children by the ca- 
pacity to represent abstract relations in symbolic terms via linguistic labels 
(Christie and Gentner, 2014). 

Most research on analogical reasoning in animals has used the Relational 
Matching-to-Sample task (RMTS: e.g., Fagot and Thompson, 2011) il- 
lustrated in Figure 5. In this task, the subject first perceives one pair of ob- 
jects which are either identical or different. Two comparison pairs are then 
presented, and the subject must indicate the stimulus pair exemplifying the 
same (same or different) relation as the sample pair. In other words, the task 
can be conceptualized as “if AA then BB, and if AB then CD”. Researchers 
in the domain of comparative cognition have tested several animal species, 
including pigeons, monkeys and apes using the RMTS task (for a review, 
see Wasserman, Castro and Fagot, in press). Most of these attempts failed 
(Thompson and Oden, 2000), but a handful of studies also provide more 
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positive results, in particular in tests involving thousands of training trials 
(e.g., Fagot and Thompson, 2011). 


Figure 5: Illustration of the relational matching task used, in baboons, with color 
(left part of the figure) and shape (right) stimuli. 


Sample pair Sample pair 
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In our laboratory, we could demonstrate that the baboons can successfully 
solve the RMTS task with pairs of color patches as stimuli (Fagot and Par- 
ron, 2010, see Figure 5). In a different study (Fagot and Thompson, 2011), 
we could further demonstrate that baboons can also solve this RMTS task 
considering the shape of the items (Figure 5). Again, this cognitive feat also 
required an extensive training period (from 17 to 30 000 trials per subject). 
In both studies, the different generalization tests confirmed the real abstract 
nature of the processes at work in these two tasks. For instance, the baboons 
could continue to solve the task with a high level of performance when 
we used novel colors (Fagot and Parron, 2010), and novel shapes (Fagot 
and Thompson, 2011) as stimuli. However, although cognitive flexibility is 
suggested by these findings, the data also suggest limits in this processing. 
In Fagot and Parron’s (2010) study, color cues were in fact proposed to 6 
baboons, and 4 of these 6 subjects eventually learned the task. In Fagot and 
Thompson’s (2011) study, the same task was given one year later to a larger 
group of subjects, including the 6 already tested in Fagot and Parron (2010). 
Six out of 29 baboons learned the RMTS task with shapes, but importantly, 
none of the baboons who had initially learned the task with colors could 
also learn it with shapes. In other words, learning the RMTS task with color 
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cues did not help at all the subjects to learn. Generalization across domains 
is central to analogical reasoning in humans, and is probably what makes 
human reasoning so flexible. Generalization across domains allows us, for 
instance, to understand the meaning of a sentence like “atoms are like tiny 
solar systems” or “life is a gift, a chocolate box”. Baboons — and probably 
other nonhuman primates as well — are quite flexible to process items and 
their relations within given domains, those they have been trained with 
(e.g., color), but are clearly not as skilled as humans to generalize across 
domains. This, we believe, is another factor that may greatly affect nonhu- 
man animal’s potential for developing elaborated forms of language. 


3. Summary and Conclusions 


The main goal of the chapter was to examine the origin of human language 
from the standpoint of comparative psychology. Language in its various 
forms (e.g., gesture, writing, speech) is a multi-level integrative process 
that requires, at the perceptual stage, the segmentation and grouping of 
perceptual information to extract the general meanings of the communica- 
tive signals. We have argued above that many of the integrative processes 
involved in the language function are in fact domain-general processes that 
can also be found in non-linguistic functions. 

Considering that language uses a multitude of domain-general functions, 
examination of these functions in animals and especially in nonhuman pri- 
mates should provide important information on the cognitive background 
that made it possible for language to emerge in our evolutionary history. 
Following this reasoning, we comparatively examined in this chapter a 
number of domain-general cognitive functions, which imply various forms 
of integration of perceptual/conceptual information. Among the considered 
functions, we examined the ability of animals to integrate information in 
time and space, to combine stimulus dimensions, to group objects into cat- 
egories, and to develop conceptual/relational processes (first- and second- 
order concepts). Of course, this list is not exhaustive but we believe that it 
represents a significant selection of basic cognitive domain-general processes 
that serve language perception and comprehension. 

The present overview of the literature allowed us to reach two main 
conclusions. First, we have identified clear-cut demonstrations that non- 
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human animals are capable of grouping information in time and space, 
can combine stimulus dimensions, and can form categories at different 
levels of abstractness. Thus, we suggest that these functions are shared by 
humans and nonhumans, at least to some extent, and that they are not 
language specific (i.e., they have a long phylogenetic history). Second, the 
literature also reveals important differences in performance among nonhu- 
man animals, and between nonhuman animals and humans. For example, 
evidence suggest that, compared to humans, in nonhuman animals the 
integration of information in time and space is more “local”, and the capac- 
ity to integrate the information on a larger scale is more restricted. When 
we come to consider how animals integrate various stimulus dimensions, 
experimental evidence suggest that they tend to focus on some particular 
physical dimensions of the stimuli more than humans do, and they hardly 
combine information from different stimulus dimensions. When we come 
to consider more general categorization processes, it appears that nonhu- 
man primates form categories but their categories seem to be more strongly 
tied to the perceptual input than those of humans, and abstract processes, 
when they emerge, need many more trials to develop, and/or do not gen- 
eralize to untrained dimensions as readily as for humans. In other words, 
for most of the integrative functions we have considered in this chapter, 
nonhuman animals show behaviors that differ at least quantitatively, if not 
qualitatively from human behavior, and we propose that these differences 
might be the bottlenecks for the evolution of language. Obviously, evolu- 
tion has no direction and mastering some human-like language is not and 
has never been an issue in nonhuman animals. However, it might be that a 
particular pattern of development of these domain-general functions was 
a prerequisite to the emergence of human language, and that favorable 
conjunctions occurred only once in the phylogeny of the primate group, 
giving rise to the human language. The idea that some domain-general 
nonlinguistic functions form the bottlenecks of language evolution will be 
further developed below. 

To account for the evolution of language, many theorists have focused 
their attention on language specific functions, which were considered as 
key factors for the evolution of human language. For example, many tra- 
ditional theories claimed that only humans have a low larynx (which was 
disconfirmed since), or that only humans have the ability to understand 
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and produce recursive structures allowing an infinite variability in language 
production. Here, we do not want to discount these explanations, but we 
think that they do not take the problem at its roots. At this point in the 
scientific endeavor, we think that it is now necessary to step back a bit to 
enlarge our view of the problem. Doing so, we can imagine two different 
scenarios on the origin of language. 

The ”language-first” scenario would be that our primate/prehominid 
ancestors had rather limited cognitive resources and it was the appearance 
of language that boosted their general cognitive capacities. Although there 
is no doubt that language is a booster for cognitive functions, we think that 
this scenario does not hold. To illustrate our rebuttal of this theory, imagine 
an animal cognitive system with a limited working memory span of N=1 
item. How could a language system develop with such limited memory re- 
sources? A clear expansion of this memory system would be required before 
the animal would have a chance to develop a proto-language system. This 
very extreme example shows that language evolution requires a cognitive 
background to support it, and it is only when such a cognitive background 
has evolved — potentially as a response to ecological pressures — that a form 
of proto-language can have a chance to emerge. 

The second possible scenario, called “prerequisites-first”, is that language 
only emerged in our evolutionary history once critical domain-general func- 
tions had gained in cognitive power in our animal/pre-hominid ancestors. 
This scenario raises one major question: which domain general function(s), 
or combination of functions, must have evolved at first for the emergence of 
language? We have no clear answer to this question but can provide several 
lines of thinking on this issue. First, we note that studies focusing on the 
so called low level perceptual mechanisms showed important differences 
between the platyrrhines and catarrhines species, but very little differences 
are observed in perceptual functions in monkeys, apes and human species 
(see for instance Fobes and King, 1982 for a review of visual perception). 
This mere fact suggests that the evolution of these perceptual functions is 
probably not the factor that made the difference and triggered the evolution 
of speech in humans. Secondly, comparative experiments suggest more per- 
vasive differences between humans and the other primates in two domains 
at least. The first domain of importance is the domain of working memory. 
Working memory in humans seems to depart from that of other animal 
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species in several important aspects, for instance regarding the ability to 
process large amount of information in parallel (e.g., Fagot and De Lillo, 
2011), or to process long-distance dependencies (e.g., Wilson et al., 2015). 
Although not discussed in the context of this chapter, working memory in 
humans may also qualitatively depart from that of the other animals in its 
use of a phonological loop facilitating memorization in the short term. The 
second domain for which strong differences emerge between humans and 
the other animals is the domain of attention. Nonhuman animals seem to 
focus on single stimulus dimensions more than humans do, and tend to 
have a more local mode of processing of the perceptual input than humans. 
There remains a debate on whether the increase in cognitive functions fol- 
lowed the phylogenetic order, from the remote prosimian species to the ape 
species phylogenetically closest to humans (e.g., Reader et al., 2011), or 
whether variations in cognitive power among the different primate species 
have occurred at multiple times in the course of evolution, in independent 
unrelated primates groups, for instance under the pressure of social factors 
such as the complexity of the social network (Dunbar, 1998). Discussing 
these hypotheses is out of the scope of the current chapter. However, what- 
ever the source of this increment in cognitive power is, we propose that the 
language ability appeared in the evolution of primates at a point in time 
where domain-general cognitive capacities, especially those pertaining to 
attention and working memory converged and were sufficiently developed 
to permit its evolution. This idea is in line with recent usage-based theories 
suggesting that language could be acquired in humans by means of domain- 
general — evolutionary old — processes (Bybee, 2010; Tomasello, 2005). 

From a more practical standpoint, we conclude from this chapter that 
the comparative investigation of non-linguistic, domain general functions 
should be considered of central importance in the debate on the evolution 
of language. Unfortunately, real comparative studies, in which humans and 
other species are tested on the same problems using the same tasks, are rela- 
tively rare in this literature, and most of them only concern a very limited 
number of species. Such studies will become mandatory to further test the 
hypothesis that the expansion of domain-general functions in nonhuman 
primates served as a basis for the evolution of language. 
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